mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
89 stars 14 forks source link

Make `--index --images` (index.php only) work on newer versions of MediaWiki and on some non-English wikis #156

Closed yzqzss closed 1 year ago

yzqzss commented 1 year ago

The key commit is: https://github.com/mediawiki-client-tools/mediawiki-scraper/pull/156/commits/dce334b3ee68df1d9cd5a1155814f17f4221255e

yzqzss commented 1 year ago

Known issue: Newer versions of MediaWiki seem to have changed the behavior of Special:Filelist's handling of the offset parameter. The trick(offset = "29990101000000") that we reverse traversal of the Special:Filelist no longer works.

Considering that fetching images via index.php is just a fallback for unavailable API, and because of our limit = 5000, this bug should only affect wikis that have unavailable API and host more than 5000 media files.

How to reproduce:

Chckout this PR and set limit parameter to 1 (here: https://github.com/elsiehupp/wikiteam3/blob/eb1529a4c18ec3d71485aea3351330f6a52cdae7/wikiteam3/dumpgenerator/dump/image/image.py#L210)

dumpgenerator --index <index.php URL> --images

NOTE:

There is no problem with this PR itself, and it can be merged normally.

robkam commented 1 year ago

https://asoiaf.fandom.com/ gets "We couldn't find an English wiki at this URL, but here are related wikis in other languages" and so on.