scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.11k stars 137 forks source link

SelectorList.drop() removing elements doesn't work as expected #297

Open dream2333 opened 1 month ago

dream2333 commented 1 month ago
def parse_detail(self, response: HtmlResponse, item: DetailDataItem):
    selectors = response.jmespath("news.body")
    selectors.xpath(".//script|.//style").drop()
    item.content = selectors.xpath("string(.)").get().strip()
    yield item

I'm trying to remove the 'style' tag from the element using selector.xpath(".//script|.//style").drop(). However, even after executing this line of code, the 'style' element still exists in the DOM.

微信截图_20240609010908

Here's url: https://newsinfo.eastmoney.com/kuaixun/v2/api/content/getnews?newsid=202406083099747443&newstype=1

dream2333 commented 1 month ago

Could someone help me understand why this is happening?

dream2333 commented 1 month ago

I've figured out why this is happening. If you perform a drop operation on a Selector that's been created from JSON in Scrapy, it cannot correctly handle the DOM. However, if you extract the HTML text from the JSON and reconstruct the Selector, this issue does not occur. This seems to be a bug in Parsel's Selector implementation.

content = response.jmespath("news.body").get()
selector = Selector(text=content, type="html")
selector.xpath(".//script|.//style").drop()
item.content = selector.xpath("string(.)").get().strip()
dream2333 commented 1 month ago

When using the .xpath method to create nodes from a text type selector, it appears that these nodes are actually copies generated from the text, rather than being generated based on the original root node. As a result, when executing the .drop method, it doesn't affect the content of the original HTML tree. This happens mostly when using jmespath and xpath in combination

This process is quite subtle. To make the .drop operation effective, we need to call .xpath(".") to generate a new HtmlSelector. Only then does the .drop operation work as expected on it. This behavior is not intuitive and could potentially lead to confusion or unexpected results. I believe it would be beneficial to either adjust this behavior or clarify it in the documentation to prevent future confusion.

selector = json_selector.jmespath("news.body").xpath(".")
selectors.xpath(".//script|.//style").drop()
item.content = selectors.xpath("string(.)").get().strip()
dream2333 commented 1 month ago

Refs #298

Tanjir369 commented 2 weeks ago

Hello