Open jlnr opened 8 years ago
Update: This is now possible because I've replaced the ragel script with a little C++ tool that is capable of streaming.
It is also slower by a factor of 10, taking 33 instead of 3 minutes to process the zhwiki dump. If the enwiki script can finish over night (<8h), that's still good enough.
It all works in memory now, you just need to set EN_DATE/ZH_DATE. The final step would be to automatically determine the latest dump date via the JSON status files (https://dumps.wikimedia.org/enwiki/20170501/dumpstatus.json etc)
Instead of downloading the full Wikipedia dump, extracting it, then running a ragel script over the XML file, can we just do it all in memory? Pseudocode:
curl -s http://dumps.wikimedia.org/.../enwiki-20170220-pages-articles-multistream.xml.bz2 | bzcat | ./extract-movies enwiki
?Rationale: Having 100 GB of free space is a rare occurence for me.