lurado / MovieDict

iOS dictionary for international movie titles & Wikipedia mining tools
https://moviedict.info
Other
7 stars 3 forks source link

Rewrite database importer to work in memory #14

Open jlnr opened 8 years ago

jlnr commented 8 years ago

Instead of downloading the full Wikipedia dump, extracting it, then running a ragel script over the XML file, can we just do it all in memory? Pseudocode: curl -s http://dumps.wikimedia.org/.../enwiki-20170220-pages-articles-multistream.xml.bz2 | bzcat | ./extract-movies enwiki?

Rationale: Having 100 GB of free space is a rare occurence for me.

jlnr commented 7 years ago

Update: This is now possible because I've replaced the ragel script with a little C++ tool that is capable of streaming.

It is also slower by a factor of 10, taking 33 instead of 3 minutes to process the zhwiki dump. If the enwiki script can finish over night (<8h), that's still good enough.

jlnr commented 7 years ago

It all works in memory now, you just need to set EN_DATE/ZH_DATE. The final step would be to automatically determine the latest dump date via the JSON status files (https://dumps.wikimedia.org/enwiki/20170501/dumpstatus.json etc)