mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
89 stars 14 forks source link

Dumping linux-mips.org: Multiple issues #131

Closed FlyGoat closed 1 year ago

FlyGoat commented 1 year ago

Hi,

I was trying to dump a backup for linux-mips.org, however I ran into multiple issues with this tool.

The SSL certification of that site is expired, causing urllib3 unhappy, I was able to workaround by:

diff --git a/wikiteam3/dumpgenerator/cli/cli.py b/wikiteam3/dumpgenerator/cli/cli.py
index 217f2ad..91e9abc 100644
--- a/wikiteam3/dumpgenerator/cli/cli.py
+++ b/wikiteam3/dumpgenerator/cli/cli.py
@@ -162,6 +162,7 @@ def getParameters(params=None) -> Tuple[Config, Dict]:
         print("Using cookies from %s" % args.cookies)
     mod_requests_text(requests)
     session = requests.Session()
+    session.verify = False

     try:
         from requests.adapters import HTTPAdapter

It would be great if we can have a cli option to disable verify but I'm fine with this workaround :-)

Then I ran into

  warnings.warn(
    We have retried 5 times
    MediaWiki error for "Main_Page", network error or whatever...
    Trying to save only the last revision for this page...

I had checked API manually with curl and it works fine. No idea why dumpgenerator is unhappy with that.

And one last thing, the program terminated with crash trace:

  File "/opt/homebrew/lib/python3.11/site-packages/wikiteam3/dumpgenerator/__init__.py", line 26, in main
    DumpGenerator()
  File "/opt/homebrew/lib/python3.11/site-packages/wikiteam3/dumpgenerator/dump/generator.py", line 115, in __init__
    DumpGenerator.createNewDump(config=config, other=other)
  File "/opt/homebrew/lib/python3.11/site-packages/wikiteam3/dumpgenerator/dump/generator.py", line 128, in createNewDump
    generateXMLDump(config=config, session=other["session"])
  File "/opt/homebrew/lib/python3.11/site-packages/wikiteam3/dumpgenerator/dump/xmldump/xml_dump.py", line 141, in generateXMLDump
    doXMLExportDump(config, session, xmlfile, lastPage)
  File "/opt/homebrew/lib/python3.11/site-packages/wikiteam3/dumpgenerator/dump/xmldump/xml_dump.py", line 66, in doXMLExportDump
    for title in readTitles(config, session=session, start=start):
  File "/opt/homebrew/lib/python3.11/site-packages/wikiteam3/dumpgenerator/api/page_titles.py", line 237, in readTitles
    getPageTitles(config=config, session=session)
  File "/opt/homebrew/lib/python3.11/site-packages/wikiteam3/dumpgenerator/api/page_titles.py", line 199, in getPageTitles
    for title in titles:
  File "/opt/homebrew/lib/python3.11/site-packages/wikiteam3/dumpgenerator/api/page_titles.py", line 29, in getPageTitlesAPI
    for page in site.allpages(namespace=namespace):
  File "/opt/homebrew/lib/python3.11/site-packages/mwclient/listing.py", line 185, in __next__
    return mwclient.page.Page(self.site, u'', info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/mwclient/page.py", line 55, in __init__
    self.protection = {
                      ^
  File "/opt/homebrew/lib/python3.11/site-packages/mwclient/page.py", line 56, in <dictcomp>
    i['type']: (i['level'], i['expiry'])
                            ~^^^^^^^^^^
KeyError: 'expiry'

It looks like a mwclient issue but as I'm pretty dumb to those stuff so I'm just posting here to ask for help.

Thanks

yzqzss commented 1 year ago

Temporary fix:

Modify the /<your-python-site-packages-path>/mwclient/page.py file to remove i['expiry'] in L56.

- i['type']: (i['level'], i['expiry'])
+ i['type']: (i['level'])

https://github.com/mwclient/mwclient/blob/4217a4ffbf492a84b29c5f4b0fcf390f93de3165/mwclient/page.py#L56

If you have no other software that relies on mwclient and use it to get the expiration setting of the protected page, then this change is safe.


Another option is using --xml --xmlrevisions, which does not use mwclient to get titles.


Error page: (expiry key missing in metainfo)

https://www.linux-mips.org/wiki?title=System_Recovery_Status&action=history

Protected "System Recovery Status" ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite))

yzqzss commented 1 year ago

Also, the url-rewrite rules on this wiki seem a bit strange. The actual addresses of index.php and api.php are:

https://www.linux-mips.org/mediawiki/api.php https://www.linux-mips.org/mediawiki/index.php

If dumpgenerator is not detecting correctly, please specify them.

AdamWill commented 1 year ago

https://github.com/mwclient/mwclient/pull/291 addresses the missing expiry key.

FlyGoat commented 1 year ago

I'm dumb to those stuff but I can confirm [--insecure] and https://github.com/mwclient/mwclient/pull/291 do work. Thank y'all for help.