mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
89 stars 14 forks source link

Handle missing files with `sha1 == False`. #114

Closed yzqzss closed 1 year ago

yzqzss commented 1 year ago

Example:

https://wiki.huihoo.com/api.php?action=query&list=allimages&aiprop=url|user|size|sha1&aifrom=Freebsdbasedhosting.jpg&format=json

{
        "name": "Freebsdbasedhosting.jpg",
        "user": "Allen",
        "size": 90830,
        "width": 700,
        "height": 600,
        "url": "http://wiki.huihoo.com/images/e/e8/Freebsdbasedhosting.jpg",
        "descriptionurl": "http://wiki.huihoo.com/wiki/%E6%96%87%E4%BB%B6:Freebsdbasedhosting.jpg",
        "sha1": false,
        "ns": 6,
        "title": "文件:Freebsdbasedhosting.jpg"
},
robkam commented 1 year ago

After the merge:

$ dumpgenerator   --xml --xmlrevisions --images --api https://wiki.huihoo.com/api.php
Checking API... https://wiki.huihoo.com/api.php
API is OK: https://wiki.huihoo.com/api.php
Checking index.php... https://wiki.huihoo.com/index.php
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./wikihuihoocom-20230118-wikidump
--delay is the default value of 0.5
There will be a 0.5 second delay between HTTP calls in order to keep the server from timing you out.
If you know that this is unnecessary, you can manually specify '--delay 0.0'.

Analysing https://wiki.huihoo.com/api.php

Warning!: "./wikihuihoocom-20230118-wikidump" path exists
There is a dump in "./wikihuihoocom-20230118-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the current session will be ignored
and the parameters available in "./wikihuihoocom-20230118-wikidump/config.json" will be loaded.
Do you want to resume ([yes, y], [no, n])? y
You have selected: YES
Loading config file...
Resuming previous dump process...
XML is corrupt? Regenerating...
https://wiki.huihoo.com/api.php
Getting the XML header from the API

Retrieving the XML for every page from the beginning

16 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Python\Scripts\dumpgenerator.exe\__main__.py", line 7, in <module>
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\__init__.py", line 26, in main
    DumpGenerator()
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 113, in __init__
    DumpGenerator.resumePreviousDump(config=config, other=other)
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 190, in resumePreviousDump
    generateXMLDump(config=config, session=other["session"])
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_dump.py", line 137, in generateXMLDump
    doXMLRevisionDump(config, session, xmlfile, lastPage, useAllrevisions=True)
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_dump.py", line 25, in doXMLRevisionDump
    for xml in getXMLRevisions(config=config, session=session, lastPage=lastPage, useAllrevision=useAllrevisions):
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\page\xmlrev\xml_revisions.py", line 80, in getXMLRevisionsByAllRevisions
    for page in arvrequest["query"]["allrevisions"]:
                ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 'allrevisions'
yzqzss commented 1 year ago

API:Allrevisions require MediaWiki 1.27+.
This wiki is 1.19.2.

robkam commented 1 year ago

--xmlrevisions download all revisions from an API generator. MediaWiki 1.27+ only. okay