scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.13k stars 144 forks source link

Fix the issue where HTML elements cannot be dropped from the text selector returned by Selector.jmespath() #298

Open dream2333 opened 3 months ago

dream2333 commented 3 months ago

When using the .xpath method to create nodes from a text type selector, it appears that these nodes are actually copies generated from the text, rather than being generated based on the original root node. As a result, when executing the .drop method, it doesn't affect the content of the original HTML tree. This issue is mostly observed when using jmespath and xpath in combination.

body_selector = response.jmespath("news.body")
styles = body_selector.xpath("//style")
styles.drop()
# always contains the content of the style tags
content = body_selector.xpath("string(.)").get()

This pull request fixes an issue where HTML elements were not being dropped correctly from a text selector. The ·_text_lazy_html_root· attribute has been added to store a temporary root node, which prevents the creation of a new root HTMLElement copy each time a text selector is used

Fixes https://github.com/scrapy/parsel/issues/297, resolves #299

dream2333 commented 3 months ago

297

Gallaecio commented 3 months ago

This feels like a workaround, I wonder if there is not some root issue that needs to be addressed here. Maybe type="text" should be removed here, although it would not surprise me if that broke something else.

Could you add a test for the issue, so it is easier to experiment with alternative solutions?

dream2333 commented 3 months ago

This feels like a workaround, I wonder if there is not some root issue that needs to be addressed here. Maybe type="text" should be removed here, although it would not surprise me if that broke something else.

Could you add a test for the issue, so it is easier to experiment with alternative solutions?

This feels like a workaround, I wonder if there is not some root issue that needs to be addressed here. Maybe type="text" should be removed here, although it would not surprise me if that broke something else.

Could you add a test for the issue, so it is easier to experiment with alternative solutions?

Sure, I'll add a test for this issue after work. My main concern is that direct modifications to the selector might break the forward compatibility of the entire library. In the current version, the implementation of converting text to HtmlElement is not very elegant either. It doesn't cache the HtmlElement generated from text, nor does it associate the HtmlElement with the selector itself. This leads to the creation of a new copy every time a query is made on a text node. Any modifications made to this copy will not affect the original Selector at all.

dream2333 commented 3 months ago

@Gallaecio Updated pull request. Added two test cases to verify the effect of dropping nodes from a Selector of type 'text'.

I tried to remove type="text", but it broke jmespath. Currently, I think adding a cache for root's HTML is the least destructive solution and it solves the problem of recreating etree._Element every time an xpath query is executed on a selector of type 'text'.

Gallaecio commented 3 months ago

So, I have created #299 as an alternative, but I have no strong opinion on which way to go. To be honest, I think both could work, i.e. this approach makes things work by default for HTML (which is also assumed to be default when no type is specified), while #299 would provide a way to make things work with XML by specifying type="xml" manually.

@kmike @wRAR Any thoughts?