Open 3VikramS opened 5 years ago
I've run into this too:
import requests_html
s = requests_html.HTMLSession()
r = s.get('https://www.ewg.org/skindeep/browse/category/Around-eye_cream/')
# the obvious way to scrape fails, only giving the first link:
print(r.html.find('section.product-listings', first=True).absolute_links)
# >>> {'https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/'}
# the correct answer, found by searching for the <div>s directly, is 12 links:
print(len(set(list(a.absolute_links)[0] for a in r.html.find('div.product-tile > a'))))
# >>> 12
The trouble is that there's an extra </a>
in each div.product-tile
:
<section class="product-listings">
<div class="product-tile ">
<a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
<div class="product-image-wrapper flex">
<img class="product-image" src="https://static.ewg.org/skindeep_images/7689/768901.jpg" />
</div>
</a>
</a> <!-- <-- THIS ONE RIGHT HERE -->
<a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
<div class="product-score">
<img class="product-score-img verified" src="https://phorcys-static.ewg.org/skindeep_rails/score-verified-aac8fea9b2dfca2fe41036b016c3dc97d955ebca605a509f8272fc7d0e275e0f.svg" />
</div>
<hr>
<p class="product-company">Divine Woman</p>
<p class="product-name"> Revitalizing Eye Cream</p>
</a> </div>
<div class="product-tile ">
<a href="https://www.ewg.org/skindeep/products/695140-Parisians_Pure_Indulgence_Peptide_Eye_Gel/">
<div class="product-image-wrapper flex">
<img class="product-image" src="https://static.ewg.org/skindeep_images/6951/695140.jpg" />
</div>
</a>
</a>
....
</section>
Firefox even recognizes them as stray and in error:
But requests_html is confusing them and taking them for the </section>
:
print(r.html.find('section.product-listings', first=True).html)
# >>>
# <section class="product-listings">
# <div class="product-tile">
# <a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
# <div class="product-image-wrapper flex">
# <img class="product-image" src="https://static.ewg.org/skindeep_images/7689/768901.jpg"/>
# </div>
# </a>
# </div></section>
(For reproducibility, here's the exact HTML I'm trying to parse: ewg.html.txt)
BeautifulSoup has no problem with this:
import bs4
import requests
r = requests.get('https://www.ewg.org/skindeep/browse/category/Around-eye_cream/')
soup = bs4.BeautifulSoup(r.text)
print(set(e['href'] for e in soup.find('section', {'class': 'product-listings'})('a')))
# >>> {'https://www.ewg.org/skindeep/products/722318-isoi_Bulgarian_Rose_Intensive_Age_Control_Eye_Cream/', 'https://www.ewg.org/skindeep/products/800352-Vermont_Skincare_Company_Eye_Cream_7_E7/', 'https://www.ewg.org/skindeep/products/741269-AHC_The_Pure_Real_Eye_Cream_For_Face/', 'https://www.ewg.org/skindeep/products/742536-Farm_Grain_Organic_Super_Green_Eye_Essence/', 'https://www.ewg.org/skindeep/products/722332-isoi_Never_Drying_Ultimate_Eye_Cream/', 'https://www.ewg.org/skindeep/products/671522-Aromatica_Rose_Absolute_Eye_Cream/', 'https://www.ewg.org/skindeep/products/926934-For_The_Biome_Awaken_Eye_Serum/', 'https://www.ewg.org/skindeep/products/902303-Codex_Beauty_Eye_Gel_Cream/', 'https://www.ewg.org/skindeep/products/695140-Parisians_Pure_Indulgence_Peptide_Eye_Gel/', 'https://www.ewg.org/skindeep/products/889523-GINJO_Moisturizing_Eye_Cream/', 'https://www.ewg.org/skindeep/products/807064-Live_Ultimate_Eye_Luminate_MultiPeptide_AntiWrinkle_Eye_Cream/', 'https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/'}
# because the section isn't truncated:
print(soup.find('section', {'class': 'product-listings'}))
# >>>
# <section class="product-listings">
# <div class="product-tile">
# <a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
# <div class="product-image-wrapper flex">
# <img class="product-image" src="https://static.ewg.org/skindeep_images/7689/768901.jpg"/>
# </div>
# </a>
# <a href="https://www.ewg.org/skindeep/products/768901-Divine_Woman_Revitalizing_Eye_Cream/">
# <div class="product-score">
# <img class="product-score-img verified" src="https://phorcys-static.ewg.org/skindeep_rails/score-verified-aac8fea9b2dfca2fe41036b016c3dc97d955ebca605a509f8272fc7d0e275e0f.svg"/>
# </div>
# <hr/>
# <p class="product-company">Divine Woman</p>
# <p class="product-name"> Revitalizing Eye Cream</p>
# </a> </div>
# <div class="product-tile">
# <a href="https://www.ewg.org/skindeep/products/695140-Parisians_Pure_Indulgence_Peptide_Eye_Gel/">
# <div class="product-image-wrapper flex">
# ...
# <hr/>
# <p class="product-company">GINJO</p>
# <p class="product-name"> Moisturizing Eye Cream</p>
# </a> </div>
# </section>
BeautifulSoup is using lxml; requests_html is using lxml; how come the parses are different? Is there a way we can tell requests_html to parse in quirks mode instead of strict mode? When screen-scraping you rarely can trust pages to be 100% correctly formatted.