taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

support ',' in query selectors #12

Closed patrikpihlstrom closed 4 years ago

taoqf commented 4 years ago

Good job!

taoqf commented 4 years ago

Will this do? @minas90 https://github.com/taoqf/node-html-parser/commit/881d708e72383908753af57417910a2a7a339711

minas90 commented 4 years ago

@taoqf Here is an example page Try to do document.querySelectorAll('p,h1') in Chrome's console. It returns NodeList(16) [h1, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p]. Because h1 appears before ps in the page, so it should consider order in the page vs. the order in query.

taoqf commented 4 years ago

Yes, you are right. But it is not easy to resolve this. I added some tests, but I cannot find the right answer for them. https://github.com/taoqf/node-html-parser/commit/a655a9a17ff031ec7c4a204e78a63743e4644e4c#diff-8e5f1331d13915fbd871f0a650422099R348 https://github.com/taoqf/node-html-parser/commit/a655a9a17ff031ec7c4a204e78a63743e4644e4c#diff-8e5f1331d13915fbd871f0a650422099R356

minas90 commented 4 years ago

I see. Tomorrow I will take a look and will try to resolve it by myself. I parse few million HTML pages every day. Right now I'm using JSDOM to get the complete DOM functionality and correct behaviour. But it's painfully slow. node-html-parser is ~80 times faster on average for my use case, but it misses a lot of functionality. So I'm considering to fork the original repo from ashi009 and implement all the missing functionality. Your fork has the highest amount of commits and it will be very helpful in the process. Thanks a lot for that!

taoqf commented 4 years ago

That will be much helpful. please let me know if there is any progress.