Open dream2333 opened 1 month ago
Could someone help me understand why this is happening?
I've figured out why this is happening. If you perform a drop operation on a Selector that's been created from JSON in Scrapy, it cannot correctly handle the DOM. However, if you extract the HTML text from the JSON and reconstruct the Selector, this issue does not occur. This seems to be a bug in Parsel's Selector implementation.
content = response.jmespath("news.body").get()
selector = Selector(text=content, type="html")
selector.xpath(".//script|.//style").drop()
item.content = selector.xpath("string(.)").get().strip()
When using the .xpath method to create nodes from a text type selector, it appears that these nodes are actually copies generated from the text, rather than being generated based on the original root node. As a result, when executing the .drop method, it doesn't affect the content of the original HTML tree. This happens mostly when using jmespath and xpath in combination
This process is quite subtle. To make the .drop operation effective, we need to call .xpath(".") to generate a new HtmlSelector. Only then does the .drop operation work as expected on it. This behavior is not intuitive and could potentially lead to confusion or unexpected results. I believe it would be beneficial to either adjust this behavior or clarify it in the documentation to prevent future confusion.
selector = json_selector.jmespath("news.body").xpath(".")
selectors.xpath(".//script|.//style").drop()
item.content = selectors.xpath("string(.)").get().strip()
Refs #298
Hello
I'm trying to remove the 'style' tag from the element using
selector.xpath(".//script|.//style").drop()
. However, even after executing this line of code, the 'style' element still exists in the DOM.Here's url: https://newsinfo.eastmoney.com/kuaixun/v2/api/content/getnews?newsid=202406083099747443&newstype=1