Closed xxyzz closed 4 months ago
The bz2 file at here https://github.com/tatuylonen/wiktextract/blob/master/tests/test-pages-articles.xml.bz2 also need to be updated after this pr is merged.
If the only change we do here is just update the namespace string, it feels like we shouldn't break older dump files. Is it possible to dynamically determine if the dump file is either 0.10 or 0.11 and pick between them in decompress_dump_file
?
I'll check how to use multiple xml namespaces in lxml's functions.
The code works fine on the new 20240601 zh edition dump file and is ready to be merged.
Thank you! If the dump file works on your side, I'll switch back to using -latest
for kaikki.org.
I didn't extract all pages, I only check if there are any empty pages and I think the Wikimedia developers have fixed the empty page bug.
I notice all 20240601 dump files' size are increasing compare to 0501 files. en: 1.1G -> 1.3G, fr: 588.7M -> 669.7M. And these files are compressed .bz2 files, extracted files will be larger. I hope the sever has enough disk spaces...
The 0501 files were the corrupted ones, so we're returning to the state that was in April, so it should (the most dangerous word) be fine.
0520 files are corrupted and removed from dumps.wikimedia.org, 0501 files are fine.
New dump files start from 20240601 use version 0.11: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038392
Please note with this change older dump files will not be extracted. I have checked en, zh and de editions and all these dump files don't have empty pages.