scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.12k stars 144 forks source link

Make sel.xpath('.') work the same for text elements #130

Open Gallaecio opened 5 years ago

Gallaecio commented 5 years ago

Given:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
...         <body>
...             <h1>Hello, Parsel!</h1>
...         </body>
...         </html>""")

For text, you get:

>>> subsel = sel.css('h1::text')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1/text()' data=u'Hello, Parsel!'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[]

However, regular elements work as you would expect:

>>> subsel = sel.css('h1')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1' data=u'<h1>Hello, Parsel!</h1>'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[<Selector xpath='.' data=u'<h1>Hello, Parsel!</h1>'>]

I believe text elements should work the same. '.' should select them if they are the current element.

redapple commented 5 years ago

Hey @Gallaecio , I'd also want to see this. Also, I believe the issue is with lxml and not libxml2 (and not parsel either): lxml text nodes do not accept further XPath calls (you can only call .getparent() on the "smart strings" results -- note that "smart_strings" are disabled by default in parsel), while libxml2 allows XPath operations on text nodes:

>>> import libxml2
>>> doc = libxml2.htmlParseDoc('''<html>
... <head>
... <meta charset="UTF-8">
... <title>Title of the document</title>
... </head>
... 
... <body>
... Content of the document......
... </body>
... 
... </html>''', 'ascii')
>>> doc
<xmlDoc (None) object at 0x7ff070272680>
>>> ctxt = doc.xpathNewContext()
>>> res = ctxt.xpathEval("//text()")
>>> res
[<xmlNode (text) object at 0x7ff0702a2560>, <xmlNode (text) object at 0x7ff071d95320>]
>>> res[0].get_content()
'Title of the document'
>>> for t in res:
...     print(t.xpathEval("parent::*"))
... 
[<xmlNode (title) object at 0x7ff07025e7e8>]
[<xmlNode (body) object at 0x7ff07025e878>]
>>> 

If you know Cython, it could be a nice addition to lxml to support this

redapple commented 5 years ago

Related: https://bugs.launchpad.net/lxml/+bug/996134