scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.15k stars 146 forks source link

Don't remove text after deleted element #247

Closed Laerte closed 2 years ago

Laerte commented 2 years ago

Fixes #206, closes #207.

Use approach proposed in #215

codecov[bot] commented 2 years ago

Codecov Report

Merging #247 (6c10cd3) into master (96fc3d7) will not change coverage. The diff coverage is n/a.

@@            Coverage Diff            @@
##            master      #247   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines          138       138           
  Branches        29        29           
=========================================
  Hits           138       138           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

kmike commented 2 years ago

Thanks @Laerte!

wRAR commented 2 years ago

We have a new probem now: as drop_tree() is only defined in HTMLElement, drop() doesn't work with the XML parser (so Selector(type='xml'), it raises CannotDropElementWithoutParent because AttributeError is raised from self.root.drop_tree().

Laerte commented 2 years ago

@wRAR Can you give an example so i can take a look?

wRAR commented 2 years ago
from parsel import Selector
sel = Selector(text="<a><b></b><c/></a>", type='xml')
el = sel.xpath('//b')[0]
assert el.root.getparent()
el.drop()
Laerte commented 2 years ago
diff --git a/parsel/selector.py b/parsel/selector.py
index b84b030..89942f0 100644
--- a/parsel/selector.py
+++ b/parsel/selector.py
@@ -550,7 +550,7 @@ class Selector:
         Drop matched nodes from the parent element.
         """
         try:
-            self.root.getparent()
+            parent = self.root.getparent()
         except AttributeError:
             # 'str' object has no attribute 'getparent'
             raise CannotRemoveElementWithoutRoot(
@@ -561,7 +561,10 @@ class Selector:
             )

         try:
-            self.root.drop_tree()
+            if self.type == "xml":
+                parent.remove(self.root)
+            else:
+                self.root.drop_tree()
         except (AttributeError, AssertionError):
             # 'NoneType' object has no attribute 'drop'
             raise CannotDropElementWithoutParent(

What you think? I tested locally and it work for you example and all others tests pass.

wRAR commented 2 years ago

Makes sense to me. I also feel like this needs new tests, as the problem wasn't caught by existing ones.

@Gallaecio you had some worries about supporting non-lxml parsers, do you have anything to add now that we have this new info that drop_tree() is even more specific?

Laerte commented 2 years ago

I can use your snippet to create one, but of course if we have more i can add too.

wRAR commented 2 years ago

Oh, I think it's enough to just run it on an XML parser.

Gallaecio commented 2 years ago

@Gallaecio you had some worries about supporting non-lxml parsers, do you have anything to add now that we have this new info that drop_tree() is even more specific?

No. It was just a matter of naming, functionally-wise there are no special considerations to have I think (for different types we will need different code, there is no other way about it).