rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.11k stars 68 forks source link

Text nodes not displayed with `deep=True` #61

Closed HugoLaurencon closed 2 years ago

HugoLaurencon commented 2 years ago

Hello, I am testing this toy example

from selectolax.parser import HTMLParser

html_str = """
<html>
<body>
<div>this is a test
    <h1>Heading</h1>
</div>
</body>
</html>
"""

selectolax_tree = HTMLParser(html_str)
for node in selectolax_tree.root.traverse(include_text=True):
    print(f"Node tag: {node.tag}")
    if node.tag == "-text":
        print(f"Node text: {node.text(deep=True)}")
    print("-------")

which outputs

Node tag: html
-------
Node tag: head
-------
Node tag: body
-------
Node tag: -text
Node text: 
-------
Node tag: div
-------
Node tag: -text
Node text: 
-------
Node tag: p
-------
Node tag: -text
Node text: Heading
-------
Node tag: -text
Node text: 

-------
Node tag: -text
Node text: 

-------

so the text node this is a test is not displayed.

If instead I write node.text(deep=False), now the text this is a test is displayed.

This behavior is not present if I remove the h1 tag and the text this is a test is displayed anyway, with deep=True or deep=False.

Any idea why?

rushter commented 2 years ago

I've pushed a fix, but It needs more tests since it can alter behaviour for other use cases. Basically, deep extraction was not working when: 1) We start from a text node 2) There is no child node 3) There is a next node (the h1 tag in your case).

rushter commented 2 years ago

I've made a new release.