scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.15k stars 146 forks source link

Selector unable to find HTML body element #208

Closed osjerick closed 3 years ago

osjerick commented 3 years ago

I've been trying to parse this page: https://picsart.com/blog/post/life-hack-fake-golden-hour-photography-picsart. However, I've noticed that the selector attached to a Scrapy response cannot get the HTML body. This is how you can reproduce the issue using pure parsel + requests:

import requests
from parsel import Selector

r = requests.get('https://picsart.com/blog/post/life-hack-fake-golden-hour-photography-picsart')
s = Selector(text=r.text)
print(s.css('body')) # prints []
print(s.xpath('//body')) # prints []

After checking the selector text with s.get(), I've noticed that there's no body node.

I thought it was a problem with the response, but then I tried BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'html.parser')
print(soup.body) # prints the HTML body node

It works the same with the lxml parser. This is the weirdest thing, as parsel uses lxml.

Also, the response has more than 1M lines. Could this size be related to the issue?

Is this an important issue, or am I missing some settings? It looks like an important issue to me.

I'm using parsel v1.6.0.

Gallaecio commented 3 years ago

Hmm, maybe not a bug but a security feature: https://stackoverflow.com/a/33831595/939364 (10MiB limit, this response is 13.5MiB)

Gallaecio commented 3 years ago

Closing in favor of https://github.com/scrapy/parsel/issues/110