psf / requests-html

Pythonic HTML Parsing for Humans™
http://html.python-requests.org
MIT License
13.7k stars 975 forks source link

HTML parsing abnormal #469

Open Alalalalaki opened 3 years ago

Alalalalaki commented 3 years ago

I recently do a "conda update --all" and then find that the HTML parsing of requests-html begins to work abnormally. In particular, the objection gotten from html.find() still contains all content of the html, e.g. if a = html.find("something", first=True), then a.text still shows all text of the page.

I then create a clean environment with only requests-html and it works well. So I guess the cause might be some recent updated version of some other package in my main environment has conflict with HTML parsing in requests-html. But I have no idea how this would happen and what would be the potential problematic package.

Any suggestion will be appreciated.

DanielPython2021 commented 3 years ago

I had the same problem. It is strange since I saw in youtube running similar code but with expected results but, it is not my experience. to help I copy, so you can reproduce the problem (these are cells from jupyter nb). I also print the results of BeautifulSoup

from requests_html import HTMLSession, HTML

doc = '<div class="class1">text1</div><div class="class2">text2</div><div class="class3">text3</div><div  class="class4">text4</div>'
`html = HTML(html=doc)

for cl in ['class1', 'class2', 'class3', 'class4']:
    print(html.find('div.' + cl, first=True).html)
    print(html.find('div.' + cl, first=True).text)
    print('-' * 100)
text1
text2
text3
text4
text1 text2 text3 text4 ----------------------------------------------------------------------------------------------------------------------
text2
text3
text4
text2 text3 text4 ----------------------------------------------------------------------------------------------------------------------
text3
text4
text3 text4 ----------------------------------------------------------------------------------------------------------------------
text4
text4 ---------------------------------------------------------------------------------------------------------------------- ``` for x in html.lxml: print(x.tag, x.attrib, x.text) print() ``` div {'class': 'class1'} text1 div {'class': 'class2'} text2 div {'class': 'class3'} text3 div {'class': 'class4'} text4 ``` from bs4 import BeautifulSoup as bs soup = bs(doc) for cl in ['class1', 'class2', 'class3', 'class4']: print(soup.find('div', {'class': cl})) print(soup.find('div', {'class': cl}).text) print('-' * 80) ```
text1
text1 --------------------------------------------------------------------------------
text2
text2 --------------------------------------------------------------------------------
text3
text3 --------------------------------------------------------------------------------
text4
text4 --------------------------------------------------------------------------------
DanielPython2021 commented 3 years ago

previous results error pasting. The last results should be as follows:

<div class="class1">text1</div>
text1
--------------------------------------------------------------------------------
<div class="class2">text2</div>
text2
--------------------------------------------------------------------------------
<div class="class3">text3</div>
text3
--------------------------------------------------------------------------------
<div class="class4">text4</div>
text4
--------------------------------------------------------------------------------