Closed kelson42 closed 9 months ago
To me it looks like that the server configuration remote has changed, and now we are over the limit.
Should we contact them and ask for the limit so we can configure appropriately ?
Should we contact them and ask for the limit so we can configure appropriately ?
I'm not sure...
Any argument against?
In the past (prior scraper devel) they said they know they have protections that might affect us. Now that this system has changed they probably have a better understanding/control of it.
They are probably more worried about feeding chatbots than our scrapers.
@Popolechien what do you think? Maybe take a look at that conversation (2y ago?)
Looking into the logs and the code, I would say that:
I consider it will be hard to solve this issue without help from wikihow, either to explain how to avoid throttling (because the fact that we try many times might incur even more throttling) or to whitelist our IPs / give us more allowance.
I will nevertheless dig a bit into more logs to check if we have more information returned by the server when we throttled, it could be that the server ask us to wait for a specific period of time and we do not respect this.
I believe we have separate logic for article retrieval (standard HTML scraping) and API calls because:
What I found in wikihow ES last run is that we are failing at the extraction of sub-categories, we are not even at the point of processing articles.
I reproduced the issue on my machine by being even more aggressive (not sure it was a good idea, anyway ...), and this is what I found:
pywikiapi
library which is indeed a mediawiki-specific library that does not exposes the requestpywikiapi
are only for requests.exceptions.ConnectionError
(as returned by requests
) and maxlag
errors (return code of the API, meant to deal with edits through the API on a replicated cluster where there might be some replication lag) ; this is not our situationResponse body:
{
"code": "ratelimited",
"info": "You've exceeded your rate limit. Please wait some time and try again.",
"docref": "See https://es.wikihow.com/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes."
}
Response headers:
{
"Connection": "keep-alive",
"Content-Type": "application/json; charset=utf-8",
"X-Robots-Tag": "noindex, nofollow",
"MediaWiki-API-Error": "ratelimited",
"X-Frame-Options": "DENY",
"Content-Disposition": "inline; filename=api-result.json",
"Cache-Control": "private, must-revalidate, max-age=0",
"Content-Encoding": "gzip",
"Accept-Ranges": "bytes",
"Date": "Mon, 21 Aug 2023 12:21:54 GMT",
"X-Timer": "S1692620514.967166,VS0,VE122",
"x-c": "cache-lin2290026-LIN,M cache-iad-kjyo7100064-IAD,m",
"x-content-type-options": "nosniff",
"x-xss-protection": "1; mode=block",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains; preload",
"Vary": "Accept-Encoding",
"alt-svc": "h3=\":443\";ma=86400,h3-29=\":443\";ma=86400,h3-27=\":443\";ma=86400",
"transfer-encoding": "chunked"
}
Suggestions more than welcome, I consider we need more information/help from Wikihow to progress on this issue.
We are contacting Wikihow to get insight. In the meantime, all Zimfarm Wikihow recipes have been disabled.
I suspected that throttling might also be induced by the fact that we had many recipes running at the same time, so I relaunched only wikihow_cs, but I'm plain wrong, delving into the data it looks like it did not happened recently, looking at most recent tasks of every schedule: | Schedule name | Task ID | status | started | scraper_completed |
---|---|---|---|---|---|
wikihow_en_endless_home-and-garden | 75473fd8-0a40-41fc-abfe-073849369b8c | succeeded | 2023-06-06T09:35:24 | 2023-06-07T08:32:08 | |
wikihow_en_endless | 74da5719-ba19-4856-baac-7a956417bcc4 | failed | 2023-06-07T08:38:59 | 2023-06-07T09:00:20 | |
wikihow_en_maxi | 33eb5b2c-7435-40ad-932c-a214ae8268f0 | failed | 2023-06-19T17:01:42 | 2023-06-19T17:22:44 | |
wikihow_ar_maxi | 4b707428-da0d-48bf-8094-91f743ec5f77 | succeeded | 2023-07-01T11:02:11 | 2023-07-02T03:17:04 | |
wikihow_en_endless_work-world | e3016a78-204d-4a4c-9a4b-71a950907549 | failed | 2023-07-02T03:23:39 | 2023-07-02T03:44:54 | |
wikihow_de_maxi | 552477cf-d5a6-4d96-bb3b-a35b35acb2b1 | succeeded | 2023-07-02T13:05:20 | 2023-07-04T05:02:05 | |
wikihow_en_endless_personal-care-and-style | 8cf6f0ed-9446-47fd-b462-776b1c106049 | failed | 2023-07-04T07:00:46 | 2023-07-04T07:21:40 | |
wikihow_en_endless_computers-and-electronics | 40e77b89-e920-40a9-9a61-fe060a02a250 | failed | 2023-07-04T22:03:33 | 2023-07-04T22:24:22 | |
wikihow_en_endless_sports-and-fitness | 9299500a-cb45-4c34-a87f-ab148065baa4 | failed | 2023-07-08T19:00:50 | 2023-07-08T19:21:39 | |
wikihow_en_endless_youth | 2087cb2d-1738-4765-900a-db75c6351aa3 | failed | 2023-07-09T09:01:02 | 2023-07-09T09:21:54 | |
wikihow_es_maxi | 40aa4413-12ef-4533-86a3-e0a378fce62d | failed | 2023-07-10T12:00:50 | 2023-07-10T12:21:49 | |
wikihow_fa_maxi | 7700f172-7c94-4204-af26-9cbd140b634c | succeeded | 2023-07-15T05:01:26 | 2023-07-15T05:32:36 | |
wikihow_en_endless_arts-and-entertainment | 8d24f101-5ac8-4ef4-b90f-99d414cac452 | failed | 2023-07-17T17:01:48 | 2023-07-17T17:23:10 | |
wikihow_en_endless_cars-and-other-vehicles | 6767c513-a6dc-43df-b15b-0edd0b864620 | failed | 2023-07-22T00:03:05 | 2023-07-22T00:24:08 | |
wikihow_en_endless_education-and-communications | 33b9aca4-2495-4d37-9666-4ef32a9cf2fd | failed | 2023-07-22T07:00:51 | 2023-07-22T07:21:42 | |
wikihow_en_endless_holidays-and-traditions | 78ea8770-d826-47aa-b662-b4d0009e6199 | failed | 2023-07-24T12:01:44 | 2023-07-24T12:30:03 | |
wikihow_en_endless_food-and-entertaining | d439240d-e294-4f87-a618-f3d345e50205 | failed | 2023-07-24T18:00:59 | 2023-07-24T18:21:52 | |
wikihow_en_endless_hobbies-and-crafts | 9dc28007-3baa-4294-b916-0cd7eab5bb00 | failed | 2023-07-25T21:00:58 | 2023-07-25T21:21:46 | |
wikihow_fr_maxi | 45f38296-6907-472d-b76a-19f9508a1fa1 | failed | 2023-07-31T11:01:08 | 2023-07-31T11:22:13 | |
wikihow_cs_maxi | dd0ddd59-f72b-410f-ac7c-70ca140ef692 | failed | 2023-09-21T05:05:11 | 2023-09-21T05:21:27 |
wikihow_cs_maxi dd0ddd59-f72b-410f-ac7c-70ca140ef692 (running with 2 secs delays, "normal" setting, failed for a weird reason, it failed to download the illustration (while the URL works from my machine).
I launched the recipe wikihow_es_maxi
on my own machine with 4 secs delays but I got rate limited again. I will relaunch it in the zimfarm with 8 secs delays (only API delay for now, this is where the rate limiting occurs looking at some tasks).
https://farm.openzim.org/pipeline/d12b352d-0498-47f1-904f-977562caaf0a is running with this 8 secs delay at the API level and "usual" 2 secs delay at the web crawler.
WikiHow uses the Platform features which sets conditions on concurrent tasks:
PLATFORM_wikihow_MAX_TASKS_PER_WORKER: "1"
PLATFORM_wikihow_MAX_TASKS_TOTAL: "1"
Task for wikihow_es is progressing normally, so:
api_delay
(for API calls) to 8 secs (instead of 2) allowed the crawler to succeeddelay
setting (for web crawling calls) is kept at 2 secs and this is okI've updated all other tasks with this same 8 secs setting and re-enabled all of them. This 8 secs delay is huge but not very significant in term of task duration because most of it is spent in the web crawling part (even with the 8 secs delay).
Any remark or may I close the ticket?
They never got back to me, so I guess we can close the ticket, yes.
See https://farm.openzim.org/recipes?category=wikihow
All scrapes die the same way: ratelimited", "info": "You've exceeded your rate limit. Please wait some time and try again."