scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.15k stars 146 forks source link

XPath union operator does not nest correctly after .css() #205

Closed jokull closed 3 years ago

jokull commented 3 years ago
In [86]: selector = Selector(text="""
    ...: <div>
    ...:   <p>
    ...:     A
    ...:     <br>
    ...:     B
    ...:   </p>
    ...: </div>  
    ...: <div>
    ...:   <p>C</p>
    ...: </div>
    ...: """)
    ...: 
    ...: selector.css("div:nth-child(1)").xpath("(//p/text() | //p/br)")
Out[86]: 
[<Selector xpath='(//p/text() | //p/br)' data='\n    A\n    '>,
 <Selector xpath='(//p/text() | //p/br)' data='<br>'>,
 <Selector xpath='(//p/text() | //p/br)' data='\n    B\n  '>,
 <Selector xpath='(//p/text() | //p/br)' data='C'>]
""")

selector.css("div:nth-child(1)").xpath("(//p/text() | //p/br)")

Before I dig into a pull request, and because I’m not a xpath expert, can someone confirm this looks wrong?

Shouldn’t the CSS qualifier scope the subsequent xpath selector and not include C in the output?

Gallaecio commented 3 years ago

Try .// instead of //, it’s an extremely common pitfall with XPath.

jokull commented 3 years ago

THANK YOU!