openzim / wikihow

WikiHow scraper
https://download.kiwix.org/zim/wikihow/
GNU General Public License v3.0
15 stars 2 forks source link

wikihow does not retry API requests #161

Open benoit74 opened 4 months ago

benoit74 commented 4 months ago

wikihow_en_endless last task failed at the beginning while listing articles in each categories

The error returned is a 503, which is probably a transient error.

Task: https://farm.openzim.org/pipeline/bc83a4cb-341e-43f6-b1e4-e17b2324b5f0/debug)

Logs:

[MainThread::2024-02-22 22:07:34,231] DEBUG:-> article: Diagnose-Auditory-Processing-Disorder
[MainThread::2024-02-22 22:07:34,231] DEBUG:-> article: Cover-Your-Ear-in-the-Shower
[MainThread::2024-02-22 22:07:50,346] ERROR:Interrupting process due to error: Call failed: {"status_code": 503, "text_body": ""}
[MainThread::2024-02-22 22:07:50,346] ERROR:Call failed: {"status_code": 503, "text_body": ""}

We should probably retry API calls not only on ConnectionError (current pywikiapi behavior) but also on what looks like a transient error, or maybe all errors except 404 and few other maybe (because it may be too complex to identify what is a transient error with certainty).

We might also consider to add this retry logic to web scraping calls (they are not retried either).

rgaudin commented 4 months ago

You might want to check https://github.com/openzim/wikihow/issues?q=is%3Aissue+503

benoit74 commented 4 months ago

My bad, then only API requests are not retried I think. Thank you!

benoit74 commented 1 month ago

This continue to impact MANY recipes (e.g. last runs of wikihow_ru_maxi, wikihow_pt_maxi, wikihow_nl_maxi)

Note that this issue might be made irrelevant (more or less) if we decide to switch from pywikiapi to another library as suggested in https://github.com/openzim/wikihow/issues/162