openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
286 stars 73 forks source link

wikipedia_arz_all (and a few other WP) is failing with `Cannot render [] into an article` #2003

Open benoit74 opened 8 months ago

benoit74 commented 8 months ago

Recipe: https://farm.openzim.org/recipes/wikipedia_arz_all

Error is:

[error] [2024-03-02T00:03:51.082Z] Error downloading article كريس_ستاندرينج
[error] [2024-03-02T00:03:51.086Z] Failed to run mwoffliner after [7351s]: {
    "name": "Error",
    "message": "Cannot render [] into an article"
}
[error] [2024-03-02T00:03:51.086Z] 

**********

Cannot render [] into an article

**********

Looks like the article is not empty online: https://arz.wikipedia.org/wiki/%D9%83%D8%B1%D9%8A%D8%B3_%D8%B3%D8%AA%D8%A7%D9%86%D8%AF%D8%B1%D9%8A%D9%86%D8%AC

benoit74 commented 7 months ago

wikipedia_id_all is impacted as well: https://github.com/openzim/zim-requests/issues/879

kelson42 commented 4 months ago

@audiodude This newly stops many Wikipedia to render properly. I believe this is not a regression with 1.14 but this impairs us seriously to move forward with testing of 1.14. Last run of WPAR is impacted: https://farm.openzim.org/pipeline/c8708ce9-f831-4c06-a9d6-748e6e860cec/debug

kelson42 commented 4 months ago

WPCA impacted as well https://farm.openzim.org/pipeline/1c29259f-d858-40f4-8cfb-530696e2b20f/debug

audiodude commented 4 months ago

Although the error message is the same, I'm not sure this is the same bug.

For WPARZ, I cannot reproduce with an articleList of only كريس_ستاندرينج.

For WPCA, it is 100% reproducible with an articleList of Khalifa_ibn_Askar. However it is also the case that https://ca.wikipedia.org/api/rest_v1/page/mobile-html/Khalifa_ibn_Askar returns empty/missing data: https://gist.github.com/audiodude/139ad898a925733d56fd08fee5a5fb9f

WPID doesn't reproduce the bug when using an article list of IL-2_Sturmovik_(series). However it fails otherwise with the following stack trace: https://gist.github.com/audiodude/7743f8e6020c4dbe9c4f32301c7e5a6e

audiodude commented 4 months ago

Finally, realizing that WPAR is different from WPARZ, I tried the former and could not reproduce with articleList of توموت

kelson42 commented 4 months ago

Hmmm, not sure what should be done next. In your log the line:

[warn] [2024-06-30T18:30:00.292Z] Couldn't find strings file for [id]

Seem suspicious.

audiodude commented 4 months ago

Seem suspicious.

That's the new message added in #2050. Before, it would simply fail to find the id file, since there's no translation file for that language, and fall back silently to en. Now it logs a message whenever it can't find a required file.

kelson42 commented 3 months ago

I get it, somehow this message is missing the keyword "language"...

audiodude commented 3 months ago

Overall though, this issue is currently non-reproducible and seems due to some kind of upstream bug. Perhaps we should update the code to be more resilient to that. It's not clear what kind of phabricator ticket we could file other than "JSON endpoint sometimes returns empty response for non-empty articles" but without a demonstrable reproduction case.