openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
275 stars 72 forks source link

wikipedia_nl_all_maxi.zim is failing create a new .zim #1992

Closed TheQuickFox closed 6 months ago

TheQuickFox commented 7 months ago

I got directed here from my Reddit post: https://www.reddit.com/r/Kiwix/comments/1apaen9/when_is_the_next_update_for_the_dutch_wikipedia/

The Dutch Maxi .zim of Wikipedia has not been updated since July 2023. I wondered when a new version would be posted and people notified that it should have been made already, but that the making of the .zim is failing. When I dive in logs I see that it may be failing at very long lists of birds.

Log: https://farm.openzim.org/recipes/wikipedia_nl_all

Mentioned lists in the fail log: https://nl.wikipedia.org/wiki/Lijst_van_vogels_in_IOC-volgorde https://nl.wikipedia.org/wiki/Lijst_van_vogels_alfabetisch (11001 birds)

Jaifroid commented 7 months ago

This affects the mini and nopc as well as the maxi (all done with the same recipe). Next run should be later this week (16th Feb).

kelson42 commented 7 months ago

There is an upstream API error, which should be reported in Phabricator. At the core or the problem is the size of such pages, a workaround would be to split them. Maybe asking kindling on the talk page will be able to achieve this.

Popolechien commented 7 months ago

Sweet mother of mercy, each page is 1.6 million bytes o_0 Couldn't we (as another quickfix) blacklist/skip pages that are deemed too big?

kelson42 commented 7 months ago

Yes, and this is IMO kind of a valid scenario for it. But IMHO before doing that it would be healthier to (1) report API bug to phabricator (2) Try to split these two pages.

Popolechien commented 7 months ago

Ok I've pinged the pages' main author, and they very helpfully replied "The logical thing to do is to fix that apparently buggy software that you use. These bird lists are large but well below the maximum allowed wikipedia page size" ¯_(ツ)_/¯

Jaifroid commented 7 months ago

So, we have a new run tomorrow. If we'd like it to succeed, then I think the blacklist function is the only viable thing right now, unless you feel it's better not to issue a new ZIM without these pages.

In any case, @kelson42 I think you're probably the only person with both technical understanding of the issue and a Phabricator account to open an issue there and explain clearly what's needed.... (sorry!).

TheQuickFox commented 6 months ago

The task has run. Mini and nopic .zim files have been created. wikipedia_nl_all_maxi.zim has been cancelled. wikipedia_nl_all_maxi.zim still is at July 2023.

The log mentions that the task failed to save a few articles. These are articles that seem to have been deleted in the time between that the scraper task started and finished. [error] [2024-02-19T10:43:24.663Z] Error downloading article Robin_Bomans [error] [2024-02-19T15:23:14.766Z] Error downloading article Daar_ga_je_Boem! [error] [2024-02-20T07:13:44.997Z] Error downloading article Veur_Elkaar [error] [2024-02-21T01:37:26.059Z] Error downloading article Robin_Bomans [error] [2024-02-21T06:25:16.221Z] Error downloading article Daar_ga_je_Boem!

I don't know if this was the reason of why the maxi .zim creation was cancelled.

Update: It looks like the .zim has successfully been made now: https://download.kiwix.org/zim/wikipedia/wikipedia_nl_all_maxi_2024-02.zim