scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.11k stars 137 forks source link

There seems to be an error in the parsing of the xpath and css #268

Closed rubbberrabbit closed 1 year ago

rubbberrabbit commented 1 year ago

I am using parsel.Selector to process my html file, but the result is unexpected, so i debug into the parsel api document to see if i am misuse the parsel and the xpath. But i find i get complete different result even with the first example and i think it is the reason why I get unexpected result in handling my html. the document i refer is https://parsel.readthedocs.io/en/latest/usage.html

from parsel import Selector
text = "<html><body><h1>Hello, Parsel!</h1></body></html>"
selector = Selector(text=text)
print(selector.css('h1'))

the guidance shows the expected result is

[<Selector xpath='descendant-or-self::h1' data='<h1>Hello, Parsel!</h1>'>]

but I get

[<Selector xpath='descendant-or-self::h1' data='<h1>Hello, Parsel!</h1></body></html>\n'>]

the Selector return all nodes after <h1> instead of inside <h1>. my python version is 3.9.12 and parsel version is 1.6.0

Gallaecio commented 1 year ago

Please, use the latest version of Parsel (1.7.0 at the moment) to report issues.

Gallaecio commented 1 year ago

In any case, I cannot reproduce this issue either way with your Python version and Parsel 1.6 or 1.7:

$ python --version
Python 3.9.12
$ cat test.py 
from parsel import Selector
text = "<html><body><h1>Hello, Parsel!</h1></body></html>"
selector = Selector(text=text)
print(selector.css('h1'))
$ pip install parsel==1.6.0
[…]
$ python test.py 
[<Selector xpath='descendant-or-self::h1' data='<h1>Hello, Parsel!</h1>'>]
$ pip install parsel==1.7.0
[…]
$ python test.py 
[<Selector xpath='descendant-or-self::h1' data='<h1>Hello, Parsel!</h1>'>]

Maybe it is your lxml version that causes the issue?

In general, it is advisable to try upgrading all your dependencies and see if the issue still reproduces, to make sure it has not already been fixed in a new version of some of the dependencies.