mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
95 stars 14 forks source link

Not resuming images downloading after errored out #15

Closed rickyzhangca closed 1 year ago

rickyzhangca commented 2 years ago

I am trying to create a dump for a Fandom wiki. The download errored out after a while. When I attempt to resume the downloading, the script downloads the images from the beginning, saying no image dump was found.

Error:

...
    Downloaded 6500 images
    Downloaded 6510 images
    Downloaded 6520 images
    Downloaded 6530 images
    Downloaded 6540 images
HTTP Error 503.
Server error, max retries exceeded.
Please resume the dump later.
https://hitman.fandom.com/index.php?title=Special%3AExport&pages=Image%3AHitmanbm12.jpg&action=submit&curonly=1&limit=1

Resuming with:

dumpgenerator https://hitman.fandom.com/wiki/Apex_Predator --xml --curonly --images --resume --path=C:/Users/[my name]/Downloads/wikiteam3-python3/hitmanfandomcom-20220714-wikidump

Gives

...
Analysing https://hitman.fandom.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
0 images were found in the directory from a previous session
Retrieving images from "start"
...

The images from the session does exist in folder. image

elsiehupp commented 2 years ago

Problems resuming past dumps are unfortunately a little bit of a known issue (though we thought we had fixed it).

The main workaround at this point would probably be re-running the dump from the start with the parameter --delay=0.5 (a 0.5-second delay between calls—though you can choose a different value—in order to avoid getting timed out).

In the meantime, apologies for the inconvenience, and thank you for bringing this to our attention!

rickyzhangca commented 2 years ago

no worries! I just decided to bring it up in case it signals some other underlying issues.

elsiehupp commented 2 years ago

I changed the default delay from 0s to 0.5s, which should help mitigate the problem for other users. With regard to fixing the underlying bug, though, I have a half-complete drastic rewrite of the entire project, and that has its own issues, lol.