maflcko / wiki-java-tools

Collection of tools for MediaWiki in Java
26 stars 21 forks source link

Imker aborts whole category download whenever a single download fails #26

Open nicolas-raoul opened 6 years ago

nicolas-raoul commented 6 years ago

Version: v16.09.13 Stack trace:

java.lang.UnknownError: MW API error. Server response was: <?xml version="1.0"?><api servedby="mw2283"><error code="maxlag" info="Waiting for 10.192.32.167: 3.3404757976532 seconds lagged." host="10.192.32.167" lag="3.3404757976532" type="db" xml:space="preserve">See https://commons.wikimedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &amp;lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&amp;gt; for notice of API deprecations and breaking changes.</error></api>

    at wiki.Wiki.fetch(Unknown Source)
    at wiki.Wiki.getImage(Unknown Source)
    at wiki.Wiki.getImage(Unknown Source)
    at app.ImkerBase$1.fetch(Unknown Source)
    at app.App.attemptFetch(Unknown Source)
    at app.ImkerBase.downloadLoop(Unknown Source)
    at app.ImkerGUI$4.doInBackground(Unknown Source)
    at app.ImkerGUI$4.doInBackground(Unknown Source)
    at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:295)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:334)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:844)

That happened after 1117 files (out of many more) got downloaded. 3 seconds of lagging does not sound like a very serious problem that would requires to abort the whole category download.

luc7v commented 6 years ago

I have this problem often.

nbehrnd commented 4 years ago

Using the imker-gui.jar (equally 16.09.13) for the first time, I join this observation. To ease replication / bug-fixing, this was my procedure:

at wiki.Wiki.fetch(Unknown Source) at wiki.Wiki.getImage(Unknown Source) at wiki.Wiki.getImage(Unknown Source) at app.ImkerBase$1.fetch(Unknown Source) at app.App.attemptFetch(Unknown Source) at app.ImkerBase.downloadLoop(Unknown Source) at app.ImkerGUI$4.doInBackground(Unknown Source) at app.ImkerGUI$4.doInBackground(Unknown Source) at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

Thus, a feature suggest: Let Imker write a permanent list of the files to download which a) the program may use if for whatever reason the batch was not yet completed. Which b) may be used by an explicit indication by the user, e.g. a quarter of a year later, to collect media in the same category which were added since the last survey, lowering the traffic neccessary.

added: With Wikimedia's own list generator such a listing may be created (even split into multiple files, too). Character encoding (e.g., Umlauts) occasionally may be an issue Imker in the files downloaded did not show, though.

nbehrnd commented 4 years ago

@nicolas-raoul Translating «Kategorie» to category, and «Anzahl der Listen» into number of lists to generate is one thing. While unlikely to be exhaustive, the little list mentioned taught me the following substitution rules between «safe for internet / pure ASCII (maybe even 7 bit?)» and special characters the uploaders may use in the file names.

|-------------------------------+-----------------------------------------|
| code -> substitute (keyed as) | example                                 |
|-------------------------------+-----------------------------------------|
| %C3%A4 -> ä ("a)              | Kläranlage ([water] purification plant) |
| %C3%B6 -> ö ("o)              | öffentlich (public, adjective)          |
| %C3%BC -> ü ("u)              | Bürger (citizen)                        |
| %C3%9F -> ß ("s, or Alt + s)  | Kuß (kiss, noun)                        |
| %C3%AE -> î (^i)              | maître (master, noun)                   |
| %C3%A9 -> é ('e)              | école (school)                          |
|-------------------------------+-----------------------------------------|
| %C3%84 -> Ä ("A)              | Ärmelkanal (the British channel)        |
| %C3%96 -> Ö ("O)              | Öffentlichkeit (public, noun)           |
| %C3%9C -> Ü ("U)              | Überraschung (surprise, noun)           |
|-------------------------------+-----------------------------------------|
| %2C -> ,                      | (comma)                                 |
| %21 -> !                      | (exclamation mark)                      |
| %27 -> '                      | (apostrophe)                            |
| %28 -> (                      | (opening parenthesis)                   |
| %29 -> )                      | (closing parenthesis)                   |
|-------------------------------+-----------------------------------------|

This gives a good reason to watch out for proper character encoding. And well, the third group (again, comme le 3e group) is the more tricky one I did not expect to see there as permitted.