maxlath / wikibase-dump-filter

Filter and format a newline-delimited JSON stream of Wikibase entities
98 stars 15 forks source link

Support for multicore systems #32

Open ghost opened 3 years ago

ghost commented 3 years ago

I use the filter in 2 ramdisks, each around 100GB large to speed up processing. Still my 32 cores machine idles at around 5% and will take 12-16 hours filtering all entries (0.5ms average time).

As i don't know nodejs a lot im not sure i can add multi threading to this but node.js totally can invoke child threads - is there an easy 2-3 line addition possible to spawn more threads? See https://nodejs.org/docs/latest/api/cluster.html

Im using server boards, but i guess lots of ppl doing this will sit on a ryzen system or similiar.

multicore unpacking the archive is doable with 'pbzip2 -d -c /mnt/ramdisk/latest-all.json.bz2 | wikibase-dump-filter', thus showing node at exactly 100% and unzipping at ~110%, so its still node the bottleneck. This halves average to 0.25 for me, but with "just" 64GB RAM on maybe some rented hosted machine with lots of cores you can get filter time down to under 30 minutes with multicore processing, thus greatly reducing costs for weekly updates.

Thanks for your great work, really sparing me days of processing, R

maxlath commented 3 years ago

It could be possible to use threads, but I haven't explored that option yet. I explored the multi-process option though, see the documentation on parallelization. Note that wikibase-dump-filter will always be the bottleneck because of the operations on JSON (parsing and stringifying), so it's worth it to pre-filter-out any line that can be, see pre-filtering

ghost commented 3 years ago

Thanks for pointing that out to me, but that somewhat makes it slower on my machine and the CLI output gets weird.

I'm glad i have something to get simple Q-P-Q with maybe labels for meta analysis so im fine, but im really surprised your software is kinda the only one out there doing this task.