psf / requests-html

Pythonic HTML Parsing for Humans™
http://html.python-requests.org
MIT License
13.64k stars 977 forks source link

Stray Closing Tag Causes Truncation In Find() #295

Open 3VikramS opened 5 years ago

3VikramS commented 5 years ago

`<table>
<thead>
<tr>
<td><i>Heading 1</i></td>
<td><i>Heading 2</i></td>
<td>Heading 3</i></td>
</tr>
</thead>
<tbody>
<tr>
<td>Row 1 Box 1</td>
<td>Row 1 Box 2</td>
<td>Row 1 Box 3</td>
</tr>
<tr>
<td>Row 2 Box 1</td>
<td>Row 2 Box 2</td>
<td>Row 3 Box 3</td>
</tr>
</tbody>
</table>`

In the above HTML code there is a stray I tag closing in the Heading 3 box, using the find function to scrape the above table would cause it to only capture the content till heading 3 and then immediately close the still open tags and be done without throwing any warning or error.

The scraped content is as belows

`<table>
<thead>
<tr>
<td><i>Heading 1</i></td>
<td><i>Heading 2</i></td>
<td>Heading 3</td></tr></thead></table>`

I would appreciate any help around this issue
kousu commented 4 years ago

I've run into this too:

import requests_html
s = requests_html.HTMLSession()

r = s.get('https://www.ewg.org/skindeep/browse/category/Around-eye_cream/')

# the obvious way to scrape fails, only giving the first link:
print(r.html.find('section.product-listings', first=True).absolute_links)
# >>> {'https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/'}

# the correct answer, found by searching for the <div>s directly, is 12 links:
print(len(set(list(a.absolute_links)[0] for a in r.html.find('div.product-tile > a'))))
# >>> 12

The trouble is that there's an extra </a> in each div.product-tile:

<section class="product-listings">
<div class="product-tile ">
<a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
<div class="product-image-wrapper flex">
<img class="product-image" src="https://static.ewg.org/skindeep_images/7689/768901.jpg" />
</div>
</a>
</a> <!-- <-- THIS ONE RIGHT HERE -->
<a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
<div class="product-score">
<img class="product-score-img verified" src="https://phorcys-static.ewg.org/skindeep_rails/score-verified-aac8fea9b2dfca2fe41036b016c3dc97d955ebca605a509f8272fc7d0e275e0f.svg" />
</div>
<hr>
<p class="product-company">Divine Woman</p>
<p class="product-name"> Revitalizing Eye Cream</p>
</a> </div>
<div class="product-tile ">
<a href="https://www.ewg.org/skindeep/products/695140-Parisians_Pure_Indulgence_Peptide_Eye_Gel/">
<div class="product-image-wrapper flex">
<img class="product-image" src="https://static.ewg.org/skindeep_images/6951/695140.jpg" />
</div>
</a>
</a>
....
</section>

Firefox even recognizes them as stray and in error:

Screenshot_2020-03-29_06-48-44

But requests_html is confusing them and taking them for the </section>:

print(r.html.find('section.product-listings', first=True).html)
# >>>
# <section class="product-listings">
# <div class="product-tile">
# <a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
# <div class="product-image-wrapper flex">
# <img class="product-image" src="https://static.ewg.org/skindeep_images/7689/768901.jpg"/>
# </div>
# </a>
# </div></section>

(For reproducibility, here's the exact HTML I'm trying to parse: ewg.html.txt)

BeautifulSoup has no problem with this:

import bs4
import requests
r = requests.get('https://www.ewg.org/skindeep/browse/category/Around-eye_cream/')
soup = bs4.BeautifulSoup(r.text)
print(set(e['href'] for e in soup.find('section', {'class': 'product-listings'})('a')))
# >>> {'https://www.ewg.org/skindeep/products/722318-isoi_Bulgarian_Rose_Intensive_Age_Control_Eye_Cream/', 'https://www.ewg.org/skindeep/products/800352-Vermont_Skincare_Company_Eye_Cream_7_E7/', 'https://www.ewg.org/skindeep/products/741269-AHC_The_Pure_Real_Eye_Cream_For_Face/', 'https://www.ewg.org/skindeep/products/742536-Farm_Grain_Organic_Super_Green_Eye_Essence/', 'https://www.ewg.org/skindeep/products/722332-isoi_Never_Drying_Ultimate_Eye_Cream/', 'https://www.ewg.org/skindeep/products/671522-Aromatica_Rose_Absolute_Eye_Cream/', 'https://www.ewg.org/skindeep/products/926934-For_The_Biome_Awaken_Eye_Serum/', 'https://www.ewg.org/skindeep/products/902303-Codex_Beauty_Eye_Gel_Cream/', 'https://www.ewg.org/skindeep/products/695140-Parisians_Pure_Indulgence_Peptide_Eye_Gel/', 'https://www.ewg.org/skindeep/products/889523-GINJO_Moisturizing_Eye_Cream/', 'https://www.ewg.org/skindeep/products/807064-Live_Ultimate_Eye_Luminate_MultiPeptide_AntiWrinkle_Eye_Cream/', 'https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/'}

# because the section isn't truncated:
print(soup.find('section', {'class': 'product-listings'}))
# >>>
# <section class="product-listings">
# <div class="product-tile">
# <a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
# <div class="product-image-wrapper flex">
# <img class="product-image" src="https://static.ewg.org/skindeep_images/7689/768901.jpg"/>
# </div>
# </a>
# <a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
# <div class="product-score">
# <img class="product-score-img verified" src="https://phorcys-static.ewg.org/skindeep_rails/score-verified-aac8fea9b2dfca2fe41036b016c3dc97d955ebca605a509f8272fc7d0e275e0f.svg"/>
# </div>
# <hr/>
# <p class="product-company">Divine Woman</p>
# <p class="product-name"> Revitalizing Eye Cream</p>
# </a> </div>
# <div class="product-tile">
# <a href="https://www.ewg.org/skindeep/products/695140-Parisians_Pure_Indulgence_Peptide_Eye_Gel/">
# <div class="product-image-wrapper flex">
# ...
# <hr/>
# <p class="product-company">GINJO</p>
# <p class="product-name"> Moisturizing Eye Cream</p>
# </a> </div>
# </section>

BeautifulSoup is using lxml; requests_html is using lxml; how come the parses are different? Is there a way we can tell requests_html to parse in quirks mode instead of strict mode? When screen-scraping you rarely can trust pages to be 100% correctly formatted.