rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.11k stars 68 forks source link

text() on tree object not passing down "strip" parameter #35

Closed phoerious closed 3 years ago

phoerious commented 3 years ago

When I call text(strip=True) on the root tree object, the strip parameter is not being passed on to the body tag object. Here's the code in parser.pyx:

    def text(self, bool deep=True, str separator='', bool strip=False):
        return self.body.text(deep=deep, separator=separator, strip=False)

So tree.text(strip=True) isn't working, but an explicit tree.body(strip=True) is.

Moreover, the whole behaviour of this parameter is somewhat wonky and unpredictable depending on where white space appears.

HTMLParser("<body><p>sfsdf\n\n\n\n          xxxx\n\n\n\n\n\n\n</p>aaa</body>\n\n\n\n").body.text(strip=True)

gives

'sfsdf\n\n\n\n          xxxxaaa'

where only trailing white space is clipped, whereas

HTMLParser("<body><p>sfsdf\n\n\n\n          \n\n\n\n\n\n\n</p>aaa</body>\n\n\n\n").body.text(strip=True)

gives

'sfsdfaaa'

which is more like what I'd expect, but it strips whitespaces completely and doesn't simply collapse it to a single space.

rushter commented 3 years ago

Moreover, the whole behaviour of this parameter is somewhat wonky and unpredictable depending on where white space appears.

It behaves similarly to Python's strip, but for each node. It prevents from extra spaces for a lot of general cases.