mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
89 stars 14 forks source link

truncated API response for "allrevisions" causes infinite loop #166

Closed makoshark closed 2 months ago

makoshark commented 1 year ago

Description of the issue

In an API request is over some threshold and you are using the API:Allrevisions API, MediaWiki truncates the data and does not include any revision data/metadata (i.e., it simply includes an empty list of revisions). This seems to be a special case of [this MediaWiki bug])https://phabricator.wikimedia.org/T86611) although I've not seen any reference to it in this particular case online.

In at least the version of MediaWiki I'm looking at, it still returns a status of 200. You can see an example here of a request that I manage to extract out of dumpgenerator: https://wikitravel.org/wiki/en/api.php?list=allrevisions&arvlimit=1&arvdir=newer&arvcontinue=20210219051441|2674861&arvprop=ids|timestamp|user|userid|size|sha1|contentmodel|comment|content|flags&continue=&meta=userinfo&uiprop=blockinfo|hasmsg&action=query&format=json

The page content seems to be 8.5MB and the API limit is a bit less than that. It seems like a spam edit.

This is the JSON version of the API response:

{
  "batchcomplete": "",
  "continue": {
    "arvcontinue": "20210219051441|2674861",
    "continue": "-||userinfo"
  },
  "warnings": {
    "result": {
      "*": "This result was truncated because it would otherwise be larger than the limit of 8,388,608 bytes."
    },
    "main": {
      "*": "Subscribe to the mediawiki-api-announce mailing list at  for notice of API deprecations and breaking changes."
    },
    "allrevisions": {
      "*": "Because \"arvslots\" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used."
    }
  },
  "query": {
    "allrevisions": [],
    "userinfo": {
      "id": 45359,
      "name": "Benjamin Mako Hill"
    }
  }
}

Because the API has not returned any revisions, the value to the arvcontinue slug that is included in the data returned by the API does not change. As a result, dumpgenerator.py assumes that everything is well and stores all the data returned (i.e., nothing) and makes the same request again. This will then happen and over and again until a user intervenes.

How to fix this

At a minimum I think it should notice that we're repeatedly seeing the same continuation for repeated subsequent "successful" (200) requests and then error out. Maybe we want to add the added stipulation that the data is empty?

A more bold approach would involve munging the continuation to add one or something else? I could imagine why we might not want to support this in the tool though. This appears to work in the specific case above but might not work in general.

I manually worked around it by removing the "content" from the arvprop parameter manually (for just this single request), handcrafting XML for that single <page>, concatenating it to the results, and then restarting. Doing something like this automatically is definitely possible, but I'm not sure it's either worth it or a good idea.

I'm happy to help coding something up to fix this but I'm honestly not sure what the best way to approach this would be.

robkam commented 2 months ago

Closing this, if it's still an issue please reopen.