mdn / stumptown-content

Other
23 stars 19 forks source link

Cheerio instead of JSDOM #60

Open peterbe opened 5 years ago

peterbe commented 5 years ago

About a decade ago it was shown that Cheerio is 8x faster than JSDOM. See https://habr.com/en/post/163979/ from 2012. It's a bit "hard to read" since it's written in Russian but it's easy to find the raw benchmark results (focusing on NodeJS). Much has been said here https://github.com/cheeriojs/cheerio/issues/700 about that claim and they acknowledge that original (Russian) benchmark is outdated and JSDOM has changed significantly since then. But someone in Feb 2019 backed up the benchmark with a speed difference of 3.8x.

Here's a good article outlining the difference between the two candidates.

Another thing on my mind was that I heard about the Mozilla Activity Stream folks who worked on parsing web pages in Node for the sake of suggestions in about:newtab. They didn't mention speed but severe memory bloat in JSDOM. It might not matter in a CLI because even if a bloats a bit it's not a daemon.

On a personal note, I really like/prefer the API of Cheerio but perhaps that's just years and years of using jQuery in browser JS and PyQuery in Python code.

peterbe commented 5 years ago

I don't think it matters for the mdn scraper script but it might matter on the build-json script once we have loooots of snippets of HTML that needs to be parsed.

peterbe commented 5 years ago

Another big difference between the two is that JSDOM is a lot less forgiving than Cheerio. Cheerio uses the htmlparser2 parser which very forgiving. Not sure if that matters until you actually have a problem though.

peterbe commented 5 years ago

Bleh! The small amount that we're using a HTML parser it might never make sense to make it faster. Apart from having to rewrite the code we might go from a total of 100ms to 50ms. :)

ddbeck commented 5 years ago

For the JSON build, I'm not sure we'll need JSDOM (or something like it) at all. We need to replace marked with unified/remark (see https://github.com/mdn/sprints/issues/1505 ā€” a PR is forthcoming to document the decision), at which point we can slice up the Markdown before generating the HTML.

For scraping, I don't really care, though I prefer the DOM idiom to jQuery, mainly because I'd much rather read MDN's DOM-related docs than anyone else's. šŸ˜†

wbamberg commented 5 years ago

I'd prefer to use DOM APIs than jQuery, to me this is the big advantage of JSDOM.

It seems unclear what the exact performance impact is either in general or in this particular project.

peterbe commented 5 years ago

I'm tempted to just close this.

What alerted me was around topics (i.e. performance, memory leaks, forgiveness) that might not be relevant or help us. We can just take a mental note that Cheerio exists as an option if any of the above mention topics surface.

frank-dspeed commented 4 years ago

why did no one benchmark this ? so many discussion but no one invested time into benchmarking?

chocolateboy commented 4 years ago

About a decade ago it was shown that Cheerio is 8x faster than JSDOM.

Much has been said here about that claim and they acknowledge that original (Russian) benchmark is outdated and JSDOM has changed significantly since then. But someone in Feb 2019 backed up the benchmark with a speed difference of 3.8x.

Not that it matters without a reproducible benchmark (and an apples to apples comparison), but, FYI, that comment doesn't "back up" the original benchmark. It claims jsdom is faster. (It also says by 3.8 seconds, whatever that means, rather than 3.8x.)