scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.15k stars 146 forks source link

.remove() also removes text after the deleted element #206

Closed Scarfmonster closed 2 years ago

Scarfmonster commented 3 years ago

I tried removing an element as a way to exclude some repeated text from a website. I used the following code:

import parsel

html = """
<html><body>
Text before.
<span>Text in.</span>
Text after.
</body></html>
"""

s = parsel.Selector(html)
s.css('span').remove()

print(s.get())

results in:

<html><body>
Text before.
</body></html>

I would expect only the span to be removed, and the text after it to be left as-is, but it always removes the "text after" either until another element is encountered or it hits the end of the parent of the removed one.

Gallaecio commented 3 years ago

I can confirm this issue in Parsel 1.6.0.

Scarfmonster commented 3 years ago

Apparently this is a byproduct of how lxml stores the text - it's a part of the preceding element, so removing the element also removes the text. I tried mitigating this in PR #207