peterjc / mediawiki_to_git_md

Convert a MediaWiki export XML file into MarkDown as a series of git commits
MIT License
55 stars 17 forks source link

RuntimeError: Proxy error(ArgumentOutOfRangeException): Year, Month, and Day parameters describe an un-representable DateTime. #32

Open 459737087 opened 10 months ago

459737087 commented 10 months ago

why?

peterjc commented 10 months ago

My guess is an invalid date (e.g. month and day mixed up from US style somewhere, or something strange like 29 February in a non-leap year).

I would start by adding some exception handling to print out some debug information.

Are you willing and able to share the Wiki dump with me by email (assuming it is not overly large)?

459737087 commented 10 months ago

https://dumps.wikimedia.org/zhwiki/20230920/#:~:text=zhwiki%2D20230920%2Dpages%2Darticles.xml.bz2

you can try this one. @peterjc

peterjc commented 10 months ago

Larger than I was expecting, assuming this is the URL you meant: zhwiki-20230920-pages-articles.xml.bz2 2.5 GB

I need to have a clean out - this machine's drive is fuller than I thought!

peterjc commented 10 months ago

Do you have the full traceback error still? I wanted to check where in the code this RuntimeError was triggered.

[The size of the Chinese wiki example makes testing this harder]

peterjc commented 10 months ago

This script is not really suitable for a wiki dump this big! It took 30mins before I killed it, but Python was apparently using 18GB or RAM and had only recorded 1.8 million revisions in SQLite (taking 3.8GB).

Update: The file has over 4 million revisions, so I got less than halfway:

$ cat zhwiki-20230920-pages-articles.xml.bz2 | bzip2 -d | grep "<revision>" -c
4339799

[I'm trying this on Python 3 with some modifications, I assume you are using it on Python 2 - see issue #33]

peterjc commented 10 months ago

Switched from macOS to Linux, over 3 million revisions parsed in ~15mins but hit a 32GB memory limit. It occurs to me that the SQLite database currently has no indexing - I'd never pushed the script to such a large example.

459737087 commented 10 months ago

I use ubuntu 20.04 and python 3.8

mathieujobin commented 10 months ago

I'm curious if it would be possible to migrate straight from MySQL to Markdown/Git without the SQLite intermediate DB ?

peterjc commented 10 months ago

@mathieujobin Currently the script does XML dump to SQLite to mediawiki files on disk, to markdown files on disk, which get tracked in git.

The SQLite intermediate is to sort the changes so that the git log is chronological. Looking back the earlier version checked in had this, even before it dealt with uploaded files - so perhaps the XML is sorted by page first?