Closed stappersg closed 3 months ago
See 54634c086bc794832506d03146dfee44ee981150
Script xml_to_git.py takes an XML dump file, mediawiki_to_md.py takes individual mediawiki files.
Also no need to decompress, extensions .gz
(commonly used) and .bz2
(typical from WikiPedia exports) are handled automatically.
Seems to behave as expected now on macOS as of v2.0.2:
$ git init testing
$ cd testing/
$ ln -s ~/Downloads/obf_mediawiki_dump_2024-02-05.xml.bz2
$ time ../xml_to_git.py -i obf_mediawiki_dump_2024-02-05.xml.bz2
WARNING - running without username to GitHub mapping
WARNING - running without username ignore list
Creating SQLite file obf_mediawiki_dump_2024-02-05.xml.bz2.sqlite
============================================================
Parsing XML and saving revisions by page.
Finished parsing XML and saved revisions by page.
Created SQLite file obf_mediawiki_dump_2024-02-05.xml.bz2.sqlite
WARNING: File system is case insensitive - a potential issue.
WARNING: Multiple case variants exist, e.g.
- BOSC 2017 schedule
- BOSC 2017 Schedule
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
WARNING: Multiple case variants exist, e.g.
- Biojava
- BioJava
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
WARNING: Multiple case variants exist, e.g.
- Biopython
- BioPython
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
WARNING: Multiple case variants exist, e.g.
- Gsoc
- GSoC
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
WARNING: Multiple case variants exist, e.g.
- Moby
- MOBY
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
============================================================
Sorting changes by revision date...
Commit 2005-12-21T16:17:09Z wiki/Main_Page.mediawiki by MediaWiki default
Commit 2005-12-22T04:32:38Z wiki/Main_Page.mediawiki by WikiSysop
...
Commit 2006-01-27T16:47:02Z wiki/MOBY.mediawiki by Jason
error: pathspec 'wiki/MOBY.mediawiki' did not match any file(s) known to git
Return code 1 from git commit
Popen(['git', 'commit', 'wiki/MOBY.mediawiki', '--date', '2006-01-27T16:47:02Z', '--author', 'Jason <anonymous.contributor@example.org>', '-F', '-', '--allow-empty'], ...)
real 0m18.367s
user 0m11.901s
sys 0m3.140s
This macOS machine has a case insensitive file system, so can't be used on this wiki.
Hurrah, seems to be working now - I'm a little disappointed in past me that this wasn't left in a working state - probably distracted by final tweaks to the wiki I was converted.
On a Linux machine from an clone of this repository as of the v2.0.2 tag:
$ git init testing
Initialized empty Git repository in /XXX/repositories/mediawiki_to_git_md/testing/.git/
$ cd testing/
$ ln -s ../obf_mediawiki_dump_2024-02-05.xml.bz2
$ time ../xml_to_git.py -i obf_mediawiki_dump_2024-02-05.xml.bz2
WARNING - running without username to GitHub mapping
WARNING - running without username ignore list
Creating SQLite file obf_mediawiki_dump_2024-02-05.xml.bz2.sqlite
============================================================
Parsing XML and saving revisions by page.
Finished parsing XML and saved revisions by page.
Created SQLite file obf_mediawiki_dump_2024-02-05.xml.bz2.sqlite
============================================================
Sorting changes by revision date...
Commit 2005-12-21T16:17:09Z wiki/Main_Page.mediawiki by MediaWiki default
Commit 2005-12-22T04:32:38Z wiki/Main_Page.mediawiki by WikiSysop
...
This takes a while to finish, hopefully my session doesn't time out ;)
And it finished:
...
Commit 2020-05-07T10:04:49Z wiki/BOSC_2019.mediawiki by Peter
============================================================
Missing information for these usernames:
1 - Aaron
8 - Aarrigoni
1 - Adamwalsh
2 - Aganapathy
10 - Akinjo
1 - Alicia19
2 - Alysia
2 - Alysia101
1 - Amackey
10 - Andreas
3 - Andrey Kislyuk
9 - Andy.Jenkinson
3 - Anurag Priyam
8 - Apeltzer
2 - Artem Tarasov
4 - Ashish
1 - Asprakash
2 - Atsuko
3 - Bastian Greshake
1 - Biorescuer
1 - Bmpvieira
1 - BocgeTbota
1 - Boconnor
1 - Brian
1 - Briandoconnor
1 - Brutos
1 - Bsanders
12 - Caral
1 - Carole Goble
1 - Chapman
217 - Chapmanb
4 - Chris Dagdigian
1 - Chrisdag
2 - Christian Höner zu Siederdissen
80 - Cjfields
4 - Cjm
1 - Claus
4 - Clayrat
1 - Clayton Wheeler
31 - Clements
11 - Cmzmasek
79 - Dag
6 - Dalke
1 - Dan Bolser
1 - Dasmoth
3 - Dave Messina
1 - DebbieCole
2 - Demver5
2 - Derobins
1 - Dfx27
231 - Dlondon
1 - Domibel
72 - EricTalevich
1 - Favrin
1 - FdtRy8
7 - Francesco Strozzi
2 - Gotgenes
1 - GpwJq3
2 - GreggHelt2
10 - HLWiencko
1 - Heikki
3 - Helios
2 - Hervé Ménager
4 - Heuermh
11 - Hrhotz
20 - Idoerg
11 - James Malone
8 - Jandot
183 - Jason
1 - Jason T Bulmer
1 - Jflatow
1 - Jgao
6 - Jhannah
1 - Jim Vallandingham
2 - Jlhoyd05
1 - Joaor
1 - Jxtx
16 - K
20 - Karsten Hokamp
1 - Kbegley
677 - Kdahlquist
4 - Ketil
1 - Klortho
1 - Konrad Foerstner
322 - Lapp
1 - Lena
1 - Lisunov
1 - LmeIcp
4 - Lsegal
5 - Lstein
174 - Maintenance script
8 - Majensen
40 - Manolin
1 - Marie101
3 - Mark
1 - Martha
1 - Martin
13 - Matus Kalas
9 - Mauricio
1 - Maxx
2 - MediaWiki default
1 - Meg Staton
1 - Michael Crusoe
8 - Michael Heuer
5 - Mmreich
3 - Mn3
2 - Molecules
3 - Mroos
1 - Mukh17
1 - Nathanhaigh
2 - Ngoto
1 - Nilx
496 - Nlharris
4 - Nmatzke
824 - Nomi
2 - Ohofmann
1 - Pablo Pareja Tobes
1 - Paola mybhaby
5 - Pditommaso
602 - Peter
1 - Peterc
17 - Peterrice
1 - Pieter
2 - Piyenk
110 - Pjotrp
2 - Ptroshin
51 - Raoul Jean Pierre Bonnal
1 - Reece
1 - Rholland
1 - Rice
3 - Ricej
1 - RiederZelma
2 - Rmounce
1 - Robbar
3 - Robert Haines
98 - RobertBuels
1 - Rogerhall
4 - Rvalls
2 - Scain
3 - Scottcain
1 - Seakean001
5 - Sebastien Mondet
1 - Senger
1 - Serdar
1 - Sergi
1 - Skwsm
5 - Smoe
5 - Spencer Bliven
11 - Stain
22 - SteffenMoeller
3 - SteveC
1 - Stian Soiland-Reyes
1 - Strubinf
1 - Susanna Sansone
2 - Takashi23
4 - Tbooth
1 - Thomas Down
1 - Tiago
1 - TiffanyBurns
2 - Toby Hocking
2 - Trac1e
4 - Uludag
1 - Uzman
1 - Verlyn
1 - Vgopalan
7 - Welch
8 - WikiSysop
8 - Yayamamo
3 - Yinjun111
1 - Yohell
There are 0 unwanted commits from blocked users.
Done
real 35m59.393s
user 0m44.882s
sys 2m36.417s
The run time will depend on the file system speed.
I won't reveal all the email addresses I matched, but as an example I personally had ended up with two accounts on the wiki (other people had too):
$ grep -i cock usernames.txt
Peterc Peter J. A. Cock <p.j.a.cock@googlemail.com>
Peter Peter J. A. Cock <p.j.a.cock@googlemail.com>
Hi,
This some how a follow-up on #38, the wish see
mediawiki_to_md.py
working.I have #39 applied, I'm using
python3
.I did execute these commands:
I got this output:
How to get beyond the
ERROR: Unexpected input obf_mediawiki_dump_2024-02-05.xml
?For what it is worth: