psf / requests-html

Pythonic HTML Parsing for Humans™
http://html.python-requests.org
MIT License
13.73k stars 978 forks source link

Reading URLs from <a> without needing to call pop() over absolute_links()? #390

Open victorhooi opened 4 years ago

victorhooi commented 4 years ago

If I have a table like so:

<table>
    <tr class="foo">
        <td>Lorem ipsum</td>
        <td><a href="www.google.com">Google</a></td>
        <td>USA</td>
    </tr>
    <tr class="foo">
        <td>Lorem ipsum</td>
        <td><a href="www.bing.com">Bing</a></td>
        <td>USA</td>
    </tr>
    <tr class="foo">
        <td>Lorem ipsum</td>
        <td><a href="www.yahoo.com">Yahoo</a></td>
        <td>USA</td>
    </tr>
</table>

I checked the requests-html API, and I can't find something to read the child link out of a<a> tag - the only options are links() and absolute_links().

absolute_links() returns a set - so to get the first value, you have to call pop() on it.

So for example, you have:

for row in r.html.find('tr.foo'):
    link = row.find('td')[1].absolute_links.pop()

However, I'm curious if I'm doing it the correct way with requests-html - or if there's some inbulit functionality I missed? Or would it make sense to add an easier way to extract links via requests-html?

samukweku commented 4 years ago

Xpath is ur friend here :

res = html.xpath("table//a/@href")

print(res)

['www.google.com', 'www.bing.com', 'www.yahoo.com']