Closed patrikpihlstrom closed 4 years ago
Will this do? @minas90 https://github.com/taoqf/node-html-parser/commit/881d708e72383908753af57417910a2a7a339711
@taoqf Here is an example page
Try to do document.querySelectorAll('p,h1')
in Chrome's console. It returns NodeList(16) [h1, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p]
.
Because h1
appears before p
s in the page, so it should consider order in the page vs. the order in query.
Yes, you are right. But it is not easy to resolve this. I added some tests, but I cannot find the right answer for them. https://github.com/taoqf/node-html-parser/commit/a655a9a17ff031ec7c4a204e78a63743e4644e4c#diff-8e5f1331d13915fbd871f0a650422099R348 https://github.com/taoqf/node-html-parser/commit/a655a9a17ff031ec7c4a204e78a63743e4644e4c#diff-8e5f1331d13915fbd871f0a650422099R356
I see. Tomorrow I will take a look and will try to resolve it by myself.
I parse few million HTML pages every day. Right now I'm using JSDOM to get the complete DOM functionality and correct behaviour. But it's painfully slow. node-html-parser
is ~80 times faster on average for my use case, but it misses a lot of functionality. So I'm considering to fork the original repo from ashi009
and implement all the missing functionality.
Your fork has the highest amount of commits and it will be very helpful in the process. Thanks a lot for that!
That will be much helpful. please let me know if there is any progress.
Good job!