openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
38 stars 2 forks source link

mdwiki and mdwiki_app recipes are failing #997

Closed Jaifroid closed 4 months ago

Jaifroid commented 4 months ago

ZIM(s) location

Last good scrapes:

Recipe(s) URL

Details

After five months of good scrapes, both mdwiki recipes have failed simultaneously. @tim-moody?

tim-moody commented 4 months ago

Was waiting to hear from you as I noticed the failure. I have been working on what I hope is a more robust version of the cacher. I ran it and it succeeded. You can run it by changing these parameters:

--articleList="https://mdwiki.wmcloud.org/nonwiki/lists/mdwikimed.tsv" --mwUrl="https://mdwiki.wmcloud.org/"

I think we now have problem pages on EN WP, which time out. The new cacher returns 404 for them rather than 50x. I was planning to test for one more month, but since the old cacher now fails I think we should switch.

Jaifroid commented 4 months ago

Ha! I tend to check round about now (or earlier) each month in preparation for making a new release...

I don't have access to change parameters or initiate runs on zimfarm. Perhaps you do? If not, maybe @benoit74. Many thanks for being on top of it, @tim-moody.

tim-moody commented 4 months ago

@Jaifroid and @benoit74, I don't have the ability to change parameters or initiate runs either. I'd appreciate it if someone could make the changes and rerun. Alternatively, I have the zims, but I'd rather not step outside of the usual workflow.

Jaifroid commented 4 months ago

@Jaifroid and @benoit74, I don't have the ability to change parameters or initiate runs either. I'd appreciate it if someone could make the changes and rerun. Alternatively, I have the zims, but I'd rather not step outside of the usual workflow.

Yes, to build the app (which is done on GitHub actions), the ZIMs need to be available in mirror.download.kiwix.org, as they are pulled (more than once) into different workflows. So it's best to wait for a re-run, which hopefully can happen next week.🤞

benoit74 commented 4 months ago

In both cases, the page "COVID-19_pandemic_in_Switzerland" seems to not be in good shape, at least wikipedia API is returning an HTTP 500 error on http://offline.mdwiki.org/w/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&page=COVID-19_pandemic_in_Switzerland.

Looks like this is linked to a recent edit of the page, which is unfortunately semi-protected.

I requested an edit: https://en.wikipedia.org/wiki/Talk:COVID-19_pandemic_in_Switzerland#Semi-protected_edit_request_on_13_May_2024

Jaifroid commented 4 months ago

OK, as I have the correct editing status, I was able to make that edit. I suspect it may take some time to carry over to mdwiki?

tim-moody commented 4 months ago

Your problem will be that COVID-19_pandemic_in_Switzerland was only the first of the various timeline of covid pages that you hit and that killed the run. There are at least 4 and a few other EN WP pages that can also kill it.

https://mdwiki.org/wiki/Timeline_of_the_COVID-19_pandemic_in_South_Africa https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Cuba

My experience is that this started with the wmf shift of data centers last month, though it was worst during the shift. At that time I was testing mdwiki.wmcloud.org and it failed repeatedly as did http://offline.mdwiki.org/.

Both of these are proxies for mdwiki.org, but the new one now proxies the EN WP pages as well and returns 404 for any 50x error page so the the run does not break.

I suspect it may take some time to carry over to mdwiki?

EN WP pages are read directly by the proxy, but they are cached, so there could be a lag for a page that does not fail, but a failed page is not cached, so would be read immediately.

FYI http://offline.mdwiki.org/ is deprecated and I hope to phase it out in July in favor of mdwiki.wmcloud.org. The latter has had two good runs so far, last month and this.

You may also be interested in https://mdwiki.wmcloud.org/nonwiki/status

Jaifroid commented 4 months ago

Hmm, so despite the edit http://offline.mdwiki.org/w/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&page=COVID-19_pandemic_in_Switzerland is still returning 500. I edited out the commented block in the stated revision (deleted the block), but it seems the issue could be something else.

@tim-moody What is the next step? We could blacklist the not-working articles for now I suppose. But looking ahead, is it a straight swap to use mdwiki.wmcloud.org, or would it require development work? I'm sorry -- I only work on the JS app(s), so don't really have experience with the backend...

Jaifroid commented 4 months ago

Another alternative might be (if this is OK with @benoit74) for us to upload manually the ZIMs from your successful run to download.kiwix.org (I have access to that). Though they might not end up being visible to library.kiwix.org... Not sure about protocol on this, @benoit74.

tim-moody commented 4 months ago

What is the next step?

I recommend moving to the new cacher, see above for the changes to the recipe.

tim-moody commented 4 months ago

https://mdwiki.wmcloud.org/w/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&page=COVID-19_pandemic_in_Switzerland

benoit74 commented 4 months ago

Thank you very much @tim-moody

I did not understood your first comment correctly, sorry for that. And did not realized either at first who you are, sorry again ^^

FYI http://offline.mdwiki.org/ is deprecated and I hope to phase it out in July in favor of mdwiki.wmcloud.org. The latter has had two good runs so far, last month and this

Then the way forward is quite obvious!

Since @RavanJAltaie is not working on Mondays, I've done the transition of configurations from offline.mdwiki.org to mdwiki.wmcloud.org and I've requested the recipes again with high priority, so they start soon (zimfarm pipe is quite significant since few days)

tim-moody commented 4 months ago

and

--articleList="https://mdwiki.wmcloud.org/nonwiki/lists/mdwikimed.tsv" ?

Jaifroid commented 4 months ago

Thank you very much, both. Fingers crossed!

benoit74 commented 4 months ago

and --articleList="https://mdwiki.wmcloud.org/nonwiki/lists/mdwikimed.tsv" ?

Yep, I only didn't mentioned we were still using http and not https so task failed once more ... :(

both are now progressing correctly, hopefully it will soon be finished and OK.

Jaifroid commented 4 months ago

The good news is that both files completed (I'm downloading now). The possibly not-so-good news is the mdwiki_en_all_app_maxi_2024-05.zim file (the one intended for use in apps), which should be around 1.5GB, is listed as 2GB. This archive should be smaller than mdwiki_en_all_maxi_2024-05.zim (standard), as it is supposed not to have a full-text index. For some strange reason it appears to be bigger unless the listing size is wrong. EDIT: Listing is wrong, see below.

Jaifroid commented 4 months ago

OK, forget that: the listing on download.kiwix.org is wrong. It says 2GB, and the archive is in fact 1.51GB (after downloading). I've seen this before, so it's a (non-serious) bug with the download library software... All looks good here. I'll close this issue after testing.

kelson42 commented 4 months ago

@Jaifroid Please open dedicated issue, so far I don't understand what you mean.

Jaifroid commented 4 months ago

@kelson42 I would, but I don't know where to open a bug for master.download.kiwix.org. See circled entry in screenshot for what I mean. After downloading it, the actual file size is 1.51GB (bottom screenshot).

image

image

Jaifroid commented 4 months ago

Come to think of it, it may just be rounding? Perhaps it has tipped from 1.49GB to 1.51GB. Probably not a bug.

benoit74 commented 4 months ago

@Jaifroid you can open an issue on https://github.com/kiwix/k8s if needed

I'm closing this ticket since problem is solved

Jaifroid commented 4 months ago

Great, thanks for your help solving this so quickly @benoit74 and @tim-moody. Regarding the file size reporting, I don't think I need to open a bug report, as it's just aggressive rounding, and ultimately it's not a value that normal users will ever see.