peterjc / mediawiki_to_git_md

Convert a MediaWiki export XML file into MarkDown as a series of git commits
MIT License
54 stars 17 forks source link

ERROR: Unexpected input obf_mediawiki_dump_2024-02-05.xml #40

Closed stappersg closed 3 months ago

stappersg commented 3 months ago

Hi,

This some how a follow-up on #38, the wish see mediawiki_to_md.py working.

I have #39 applied, I'm using python3.

I did execute these commands:

git init issue40
cd issue40/
wget -O obf_mediawiki_dump_2024-02-05.xml.bz2 https://www.dropbox.com/s/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2?dl=0
bunzip2 obf_mediawiki_dump_2024-02-05.xml.bz2 
ls -lh
head obf_mediawiki_dump_2024-02-05.xml 
../mediawiki_to_md.py obf_mediawiki_dump_2024-02-05.xml 
../mediawiki_to_md.py --input obf_mediawiki_dump_2024-02-05.xml

I got this output:

stappers@laptop:~/src/github/mediawiki_to_git_md
$ git init issue40
Initialized empty Git repository in /home/gs0604/src/github/mediawiki_to_git_md/issue40/.git/
stappers@laptop:~/src/github/mediawiki_to_git_md
$ cd issue40/
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ wget -O obf_mediawiki_dump_2024-02-05.xml.bz2 https://www.dropbox.com/s/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2?dl=0
--2024-07-15 22:19:25--  https://www.dropbox.com/s/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2 [following]
--2024-07-15 22:19:27--  https://www.dropbox.com/s/raw/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com/cd/0/inline/CWzcE3O79ouVHBEP5Wa5jo_nWVRzuXfjkD43_yUNzfQQsdRdSBhiUOWMT9RjwOmOJy9XNHPw7k_0s9YYj910YlWlqqW3fQmlozGpycMaaIv2eSk8Xbot0gfZuKB_uK9q7Lg/file# [following]
--2024-07-15 22:19:27--  https://uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com/cd/0/inline/CWzcE3O79ouVHBEP5Wa5jo_nWVRzuXfjkD43_yUNzfQQsdRdSBhiUOWMT9RjwOmOJy9XNHPw7k_0s9YYj910YlWlqqW3fQmlozGpycMaaIv2eSk8Xbot0gfZuKB_uK9q7Lg/file
Resolving uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com (uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com)... 162.125.65.15, 2620:100:6021:15::a27d:410f
Connecting to uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com (uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com)|162.125.65.15|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /cd/0/inline2/CWw4RiqFmRxq166Jy7YLmZ6hm4uUWtwa4y9eUvLDwUFOIbKGDhxW0H7XRdMClFme5YAHFDkB9vBr33qcxyb5V3k-91VhD7-lSe3eC9up1kRml3roKOfwIdKzj6T8uvu4ULHv4DjD-KhI1VjehSK_YDr6Ol_UttTcENVvayodo5tIOVY4ZVuOqBGTIMfxrCFYcrUvdynwCP8xjy1mjgsUOfnTL1eKvZE1xqfCw0wu723uy6PXYV0g-e5252y7LO6oHyvzJ_4iucbM2YqWEHma2LpgCPgxp77CL5V4WxgfDmGzPYco23IHWAXMBKTRMy6Vl6ckV0wM98qEsQLhMjQ_axp3aGcuOzuS3DqFP38f1l_A3A/file [following]
--2024-07-15 22:19:29--  https://uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com/cd/0/inline2/CWw4RiqFmRxq166Jy7YLmZ6hm4uUWtwa4y9eUvLDwUFOIbKGDhxW0H7XRdMClFme5YAHFDkB9vBr33qcxyb5V3k-91VhD7-lSe3eC9up1kRml3roKOfwIdKzj6T8uvu4ULHv4DjD-KhI1VjehSK_YDr6Ol_UttTcENVvayodo5tIOVY4ZVuOqBGTIMfxrCFYcrUvdynwCP8xjy1mjgsUOfnTL1eKvZE1xqfCw0wu723uy6PXYV0g-e5252y7LO6oHyvzJ_4iucbM2YqWEHma2LpgCPgxp77CL5V4WxgfDmGzPYco23IHWAXMBKTRMy6Vl6ckV0wM98qEsQLhMjQ_axp3aGcuOzuS3DqFP38f1l_A3A/file
Reusing existing connection to uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 126581686 (121M) [application/octet-stream]
Saving to: 'obf_mediawiki_dump_2024-02-05.xml.bz2'

obf_mediawiki_dump_2024-02-05.xml.bz2              100%[=============================================================>] 120.72M  2.25MB/s    in 58s     

2024-07-15 22:20:28 (2.07 MB/s) - 'obf_mediawiki_dump_2024-02-05.xml.bz2' saved [126581686/126581686]

stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ bunzip2 obf_mediawiki_dump_2024-02-05.xml.bz2 
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ ls -lh
totaal 221M
-rw-r--r-- 1 stappers stappers 221M 15 jul 22:20 obf_mediawiki_dump_2024-02-05.xml
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ head obf_mediawiki_dump_2024-02-05.xml 
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Open Bioinformatics Foundation</sitename>
    <dbname>obfiztkb_mw688</dbname>
    <base>https://www.open-bio.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.29.3</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ ../mediawiki_to_md.py obf_mediawiki_dump_2024-02-05.xml 
usage: mediawiki_to_md.py [-h] -i NAMES [NAMES ...] [-p PREFIX] [--mediawiki-ext EXT] [--markdown-ext EXT]
mediawiki_to_md.py: error: the following arguments are required: -i/--input
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ ../mediawiki_to_md.py --input obf_mediawiki_dump_2024-02-05.xml 
Will be using pandoc 2.17.1.1
ERROR: Unexpected input obf_mediawiki_dump_2024-02-05.xml
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$

How to get beyond the ERROR: Unexpected input obf_mediawiki_dump_2024-02-05.xml?

For what it is worth:

$ python3 --version
Python 3.11.2
$ python3
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
$
peterjc commented 3 months ago

See 54634c086bc794832506d03146dfee44ee981150

Script xml_to_git.py takes an XML dump file, mediawiki_to_md.py takes individual mediawiki files.

peterjc commented 3 months ago

Also no need to decompress, extensions .gz (commonly used) and .bz2 (typical from WikiPedia exports) are handled automatically.

peterjc commented 3 months ago

Seems to behave as expected now on macOS as of v2.0.2:

$ git init testing
$ cd testing/
$ ln -s ~/Downloads/obf_mediawiki_dump_2024-02-05.xml.bz2
$ time ../xml_to_git.py -i obf_mediawiki_dump_2024-02-05.xml.bz2
WARNING - running without username to GitHub mapping
WARNING - running without username ignore list
Creating SQLite file obf_mediawiki_dump_2024-02-05.xml.bz2.sqlite
============================================================
Parsing XML and saving revisions by page.
Finished parsing XML and saved revisions by page.
Created SQLite file obf_mediawiki_dump_2024-02-05.xml.bz2.sqlite
WARNING: File system is case insensitive - a potential issue.
WARNING: Multiple case variants exist, e.g.
 - BOSC 2017 schedule
 - BOSC 2017 Schedule
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
WARNING: Multiple case variants exist, e.g.
 - Biojava
 - BioJava
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
WARNING: Multiple case variants exist, e.g.
 - Biopython
 - BioPython
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
WARNING: Multiple case variants exist, e.g.
 - Gsoc
 - GSoC
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
WARNING: Multiple case variants exist, e.g.
 - Moby
 - MOBY
If your file system cannot support such filenames at the same time
(e.g. Windows, or default Mac OS X) this conversion will FAIL.
============================================================
Sorting changes by revision date...
Commit 2005-12-21T16:17:09Z wiki/Main_Page.mediawiki by MediaWiki default
Commit 2005-12-22T04:32:38Z wiki/Main_Page.mediawiki by WikiSysop
...
Commit 2006-01-27T16:47:02Z wiki/MOBY.mediawiki by Jason
error: pathspec 'wiki/MOBY.mediawiki' did not match any file(s) known to git
Return code 1 from git commit
Popen(['git', 'commit', 'wiki/MOBY.mediawiki', '--date', '2006-01-27T16:47:02Z', '--author', 'Jason <anonymous.contributor@example.org>', '-F', '-', '--allow-empty'], ...)

real    0m18.367s
user    0m11.901s
sys 0m3.140s

This macOS machine has a case insensitive file system, so can't be used on this wiki.

peterjc commented 3 months ago

Hurrah, seems to be working now - I'm a little disappointed in past me that this wasn't left in a working state - probably distracted by final tweaks to the wiki I was converted.

On a Linux machine from an clone of this repository as of the v2.0.2 tag:

$ git init testing
Initialized empty Git repository in /XXX/repositories/mediawiki_to_git_md/testing/.git/
$ cd testing/
$ ln -s ../obf_mediawiki_dump_2024-02-05.xml.bz2
$ time ../xml_to_git.py -i obf_mediawiki_dump_2024-02-05.xml.bz2
WARNING - running without username to GitHub mapping
WARNING - running without username ignore list
Creating SQLite file obf_mediawiki_dump_2024-02-05.xml.bz2.sqlite
============================================================
Parsing XML and saving revisions by page.
Finished parsing XML and saved revisions by page.
Created SQLite file obf_mediawiki_dump_2024-02-05.xml.bz2.sqlite
============================================================
Sorting changes by revision date...
Commit 2005-12-21T16:17:09Z wiki/Main_Page.mediawiki by MediaWiki default
Commit 2005-12-22T04:32:38Z wiki/Main_Page.mediawiki by WikiSysop
...

This takes a while to finish, hopefully my session doesn't time out ;)

peterjc commented 3 months ago

And it finished:

...
Commit 2020-05-07T10:04:49Z wiki/BOSC_2019.mediawiki by Peter
============================================================
Missing information for these usernames:
1 - Aaron
8 - Aarrigoni
1 - Adamwalsh
2 - Aganapathy
10 - Akinjo
1 - Alicia19
2 - Alysia
2 - Alysia101
1 - Amackey
10 - Andreas
3 - Andrey Kislyuk
9 - Andy.Jenkinson
3 - Anurag Priyam
8 - Apeltzer
2 - Artem Tarasov
4 - Ashish
1 - Asprakash
2 - Atsuko
3 - Bastian Greshake
1 - Biorescuer
1 - Bmpvieira
1 - BocgeTbota
1 - Boconnor
1 - Brian
1 - Briandoconnor
1 - Brutos
1 - Bsanders
12 - Caral
1 - Carole Goble
1 - Chapman
217 - Chapmanb
4 - Chris Dagdigian
1 - Chrisdag
2 - Christian Höner zu Siederdissen
80 - Cjfields
4 - Cjm
1 - Claus
4 - Clayrat
1 - Clayton Wheeler
31 - Clements
11 - Cmzmasek
79 - Dag
6 - Dalke
1 - Dan Bolser
1 - Dasmoth
3 - Dave Messina
1 - DebbieCole
2 - Demver5
2 - Derobins
1 - Dfx27
231 - Dlondon
1 - Domibel
72 - EricTalevich
1 - Favrin
1 - FdtRy8
7 - Francesco Strozzi
2 - Gotgenes
1 - GpwJq3
2 - GreggHelt2
10 - HLWiencko
1 - Heikki
3 - Helios
2 - Hervé Ménager
4 - Heuermh
11 - Hrhotz
20 - Idoerg
11 - James Malone
8 - Jandot
183 - Jason
1 - Jason T Bulmer
1 - Jflatow
1 - Jgao
6 - Jhannah
1 - Jim Vallandingham
2 - Jlhoyd05
1 - Joaor
1 - Jxtx
16 - K
20 - Karsten Hokamp
1 - Kbegley
677 - Kdahlquist
4 - Ketil
1 - Klortho
1 - Konrad Foerstner
322 - Lapp
1 - Lena
1 - Lisunov
1 - LmeIcp
4 - Lsegal
5 - Lstein
174 - Maintenance script
8 - Majensen
40 - Manolin
1 - Marie101
3 - Mark
1 - Martha
1 - Martin
13 - Matus Kalas
9 - Mauricio
1 - Maxx
2 - MediaWiki default
1 - Meg Staton
1 - Michael Crusoe
8 - Michael Heuer
5 - Mmreich
3 - Mn3
2 - Molecules
3 - Mroos
1 - Mukh17
1 - Nathanhaigh
2 - Ngoto
1 - Nilx
496 - Nlharris
4 - Nmatzke
824 - Nomi
2 - Ohofmann
1 - Pablo Pareja Tobes
1 - Paola mybhaby
5 - Pditommaso
602 - Peter
1 - Peterc
17 - Peterrice
1 - Pieter
2 - Piyenk
110 - Pjotrp
2 - Ptroshin
51 - Raoul Jean Pierre Bonnal
1 - Reece
1 - Rholland
1 - Rice
3 - Ricej
1 - RiederZelma
2 - Rmounce
1 - Robbar
3 - Robert Haines
98 - RobertBuels
1 - Rogerhall
4 - Rvalls
2 - Scain
3 - Scottcain
1 - Seakean001
5 - Sebastien Mondet
1 - Senger
1 - Serdar
1 - Sergi
1 - Skwsm
5 - Smoe
5 - Spencer Bliven
11 - Stain
22 - SteffenMoeller
3 - SteveC
1 - Stian Soiland-Reyes
1 - Strubinf
1 - Susanna Sansone
2 - Takashi23
4 - Tbooth
1 - Thomas Down
1 - Tiago
1 - TiffanyBurns
2 - Toby Hocking
2 - Trac1e
4 - Uludag
1 - Uzman
1 - Verlyn
1 - Vgopalan
7 - Welch
8 - WikiSysop
8 - Yayamamo
3 - Yinjun111
1 - Yohell
There are 0 unwanted commits from blocked users.
Done

real    35m59.393s
user    0m44.882s
sys 2m36.417s

The run time will depend on the file system speed.

I won't reveal all the email addresses I matched, but as an example I personally had ended up with two accounts on the wiki (other people had too):

$ grep -i cock usernames.txt
Peterc  Peter J. A. Cock <p.j.a.cock@googlemail.com>
Peter   Peter J. A. Cock <p.j.a.cock@googlemail.com>