tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
93 stars 23 forks source link

Update XML dump file namespace version #288

Closed xxyzz closed 4 months ago

xxyzz commented 4 months ago

New dump files start from 20240601 use version 0.11: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038392

Please note with this change older dump files will not be extracted. I have checked en, zh and de editions and all these dump files don't have empty pages.

xxyzz commented 4 months ago

The bz2 file at here https://github.com/tatuylonen/wiktextract/blob/master/tests/test-pages-articles.xml.bz2 also need to be updated after this pr is merged.

kristian-clausal commented 4 months ago

If the only change we do here is just update the namespace string, it feels like we shouldn't break older dump files. Is it possible to dynamically determine if the dump file is either 0.10 or 0.11 and pick between them in decompress_dump_file?

xxyzz commented 4 months ago

I'll check how to use multiple xml namespaces in lxml's functions.

xxyzz commented 4 months ago

The code works fine on the new 20240601 zh edition dump file and is ready to be merged.

kristian-clausal commented 4 months ago

Thank you! If the dump file works on your side, I'll switch back to using -latest for kaikki.org.

xxyzz commented 4 months ago

I didn't extract all pages, I only check if there are any empty pages and I think the Wikimedia developers have fixed the empty page bug.

xxyzz commented 4 months ago

I notice all 20240601 dump files' size are increasing compare to 0501 files. en: 1.1G -> 1.3G, fr: 588.7M -> 669.7M. And these files are compressed .bz2 files, extracted files will be larger. I hope the sever has enough disk spaces...

kristian-clausal commented 4 months ago

The 0501 files were the corrupted ones, so we're returning to the state that was in April, so it should (the most dangerous word) be fine.

xxyzz commented 4 months ago

0520 files are corrupted and removed from dumps.wikimedia.org, 0501 files are fine.