scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.15k stars 146 forks source link

XPath query is buggy #296

Closed shner-elmo closed 4 months ago

shner-elmo commented 6 months ago

Hey so I'm trying to locate a table inside the HTML using an XPath, and its not working well, when I select the first element [1] it returns a list of two elements instead of just one (I tested it on chrome and it works correctly there).

This is the code that I used to initialize it:

import parsel

html = '....'
sel = parsel.Selector(html)

And the bug: image

Gallaecio commented 6 months ago

What you do on Chrome does not matter, because Chrome does not work on the raw HTML response, but on the DOM.

I bet there are 2 tables that are the first element of their parent. (//table)[1] probably does what you want.

shner-elmo commented 6 months ago

What you do on Chrome does not matter, because Chrome does not work on the raw HTML response, but on the DOM.

I don't understand, how is the DOM different from the HTML? because maybe some JS modified it?

If that's the case it's the same thing because the HTML that I opened in Chrome was a local file (file://...) that I saved from a website.

kmike commented 6 months ago

@shner-elmo there are some more caveats, even unrelated to JS; see https://docs.scrapy.org/en/latest/topics/developer-tools.html#caveats-with-inspecting-the-live-browser-dom