Open peterbe opened 5 years ago
I don't think it matters for the mdn scraper script but it might matter on the build-json script once we have loooots of snippets of HTML that needs to be parsed.
Another big difference between the two is that JSDOM is a lot less forgiving than Cheerio. Cheerio uses the htmlparser2 parser which very forgiving. Not sure if that matters until you actually have a problem though.
Bleh! The small amount that we're using a HTML parser it might never make sense to make it faster. Apart from having to rewrite the code we might go from a total of 100ms to 50ms. :)
For the JSON build, I'm not sure we'll need JSDOM (or something like it) at all. We need to replace marked with unified/remark (see https://github.com/mdn/sprints/issues/1505 ā a PR is forthcoming to document the decision), at which point we can slice up the Markdown before generating the HTML.
For scraping, I don't really care, though I prefer the DOM idiom to jQuery, mainly because I'd much rather read MDN's DOM-related docs than anyone else's. š
I'd prefer to use DOM APIs than jQuery, to me this is the big advantage of JSDOM.
It seems unclear what the exact performance impact is either in general or in this particular project.
I'm tempted to just close this.
What alerted me was around topics (i.e. performance, memory leaks, forgiveness) that might not be relevant or help us. We can just take a mental note that Cheerio exists as an option if any of the above mention topics surface.
why did no one benchmark this ? so many discussion but no one invested time into benchmarking?
About a decade ago it was shown that Cheerio is 8x faster than JSDOM.
Much has been said here about that claim and they acknowledge that original (Russian) benchmark is outdated and JSDOM has changed significantly since then. But someone in Feb 2019 backed up the benchmark with a speed difference of 3.8x.
Not that it matters without a reproducible benchmark (and an apples to apples comparison), but, FYI, that comment doesn't "back up" the original benchmark. It claims jsdom is faster. (It also says by 3.8 seconds, whatever that means, rather than 3.8x.)
About a decade ago it was shown that Cheerio is 8x faster than JSDOM. See https://habr.com/en/post/163979/ from 2012. It's a bit "hard to read" since it's written in Russian but it's easy to find the raw benchmark results (focusing on NodeJS). Much has been said here https://github.com/cheeriojs/cheerio/issues/700 about that claim and they acknowledge that original (Russian) benchmark is outdated and JSDOM has changed significantly since then. But someone in Feb 2019 backed up the benchmark with a speed difference of 3.8x.
Here's a good article outlining the difference between the two candidates.
Another thing on my mind was that I heard about the Mozilla Activity Stream folks who worked on parsing web pages in Node for the sake of suggestions in
about:newtab
. They didn't mention speed but severe memory bloat in JSDOM. It might not matter in a CLI because even if a bloats a bit it's not a daemon.On a personal note, I really like/prefer the API of Cheerio but perhaps that's just years and years of using jQuery in browser JS and PyQuery in Python code.