openzim / wikihow

WikiHow scraper
https://download.kiwix.org/zim/wikihow/
GNU General Public License v3.0
15 stars 2 forks source link

WikiHow does not scrape anymore, too many http requests #150

Closed kelson42 closed 9 months ago

kelson42 commented 10 months ago

See https://farm.openzim.org/recipes?category=wikihow

All scrapes die the same way: ratelimited", "info": "You've exceeded your rate limit. Please wait some time and try again."

kelson42 commented 10 months ago

To me it looks like that the server configuration remote has changed, and now we are over the limit.

rgaudin commented 10 months ago

Should we contact them and ask for the limit so we can configure appropriately ?

kelson42 commented 10 months ago

Should we contact them and ask for the limit so we can configure appropriately ?

I'm not sure...

rgaudin commented 10 months ago

Any argument against?

In the past (prior scraper devel) they said they know they have protections that might affect us. Now that this system has changed they probably have a better understanding/control of it.

They are probably more worried about feeding chatbots than our scrapers.

@Popolechien what do you think? Maybe take a look at that conversation (2y ago?)

benoit74 commented 10 months ago

Looking into the logs and the code, I would say that:

I consider it will be hard to solve this issue without help from wikihow, either to explain how to avoid throttling (because the fact that we try many times might incur even more throttling) or to whitelist our IPs / give us more allowance.

I will nevertheless dig a bit into more logs to check if we have more information returned by the server when we throttled, it could be that the server ask us to wait for a specific period of time and we do not respect this.

rgaudin commented 10 months ago

I believe we have separate logic for article retrieval (standard HTML scraping) and API calls because:

benoit74 commented 10 months ago

What I found in wikihow ES last run is that we are failing at the extraction of sub-categories, we are not even at the point of processing articles.

I reproduced the issue on my machine by being even more aggressive (not sure it was a good idea, anyway ...), and this is what I found:

Response body:

{
  "code": "ratelimited",
  "info": "You've exceeded your rate limit. Please wait some time and try again.",
  "docref": "See https://es.wikihow.com/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes."
}

Response headers:

{
  "Connection": "keep-alive",
  "Content-Type": "application/json; charset=utf-8",
  "X-Robots-Tag": "noindex, nofollow",
  "MediaWiki-API-Error": "ratelimited",
  "X-Frame-Options": "DENY",
  "Content-Disposition": "inline; filename=api-result.json",
  "Cache-Control": "private, must-revalidate, max-age=0",
  "Content-Encoding": "gzip",
  "Accept-Ranges": "bytes",
  "Date": "Mon, 21 Aug 2023 12:21:54 GMT",
  "X-Timer": "S1692620514.967166,VS0,VE122",
  "x-c": "cache-lin2290026-LIN,M cache-iad-kjyo7100064-IAD,m",
  "x-content-type-options": "nosniff",
  "x-xss-protection": "1; mode=block",
  "Strict-Transport-Security": "max-age=31536000; includeSubDomains; preload",
  "Vary": "Accept-Encoding",
  "alt-svc": "h3=\":443\";ma=86400,h3-29=\":443\";ma=86400,h3-27=\":443\";ma=86400",
  "transfer-encoding": "chunked"
}

Suggestions more than welcome, I consider we need more information/help from Wikihow to progress on this issue.

benoit74 commented 10 months ago

We are contacting Wikihow to get insight. In the meantime, all Zimfarm Wikihow recipes have been disabled.

benoit74 commented 9 months ago
I suspected that throttling might also be induced by the fact that we had many recipes running at the same time, so I relaunched only wikihow_cs, but I'm plain wrong, delving into the data it looks like it did not happened recently, looking at most recent tasks of every schedule: Schedule name Task ID status started scraper_completed
wikihow_en_endless_home-and-garden 75473fd8-0a40-41fc-abfe-073849369b8c succeeded 2023-06-06T09:35:24 2023-06-07T08:32:08
wikihow_en_endless 74da5719-ba19-4856-baac-7a956417bcc4 failed 2023-06-07T08:38:59 2023-06-07T09:00:20
wikihow_en_maxi 33eb5b2c-7435-40ad-932c-a214ae8268f0 failed 2023-06-19T17:01:42 2023-06-19T17:22:44
wikihow_ar_maxi 4b707428-da0d-48bf-8094-91f743ec5f77 succeeded 2023-07-01T11:02:11 2023-07-02T03:17:04
wikihow_en_endless_work-world e3016a78-204d-4a4c-9a4b-71a950907549 failed 2023-07-02T03:23:39 2023-07-02T03:44:54
wikihow_de_maxi 552477cf-d5a6-4d96-bb3b-a35b35acb2b1 succeeded 2023-07-02T13:05:20 2023-07-04T05:02:05
wikihow_en_endless_personal-care-and-style 8cf6f0ed-9446-47fd-b462-776b1c106049 failed 2023-07-04T07:00:46 2023-07-04T07:21:40
wikihow_en_endless_computers-and-electronics 40e77b89-e920-40a9-9a61-fe060a02a250 failed 2023-07-04T22:03:33 2023-07-04T22:24:22
wikihow_en_endless_sports-and-fitness 9299500a-cb45-4c34-a87f-ab148065baa4 failed 2023-07-08T19:00:50 2023-07-08T19:21:39
wikihow_en_endless_youth 2087cb2d-1738-4765-900a-db75c6351aa3 failed 2023-07-09T09:01:02 2023-07-09T09:21:54
wikihow_es_maxi 40aa4413-12ef-4533-86a3-e0a378fce62d failed 2023-07-10T12:00:50 2023-07-10T12:21:49
wikihow_fa_maxi 7700f172-7c94-4204-af26-9cbd140b634c succeeded 2023-07-15T05:01:26 2023-07-15T05:32:36
wikihow_en_endless_arts-and-entertainment 8d24f101-5ac8-4ef4-b90f-99d414cac452 failed 2023-07-17T17:01:48 2023-07-17T17:23:10
wikihow_en_endless_cars-and-other-vehicles 6767c513-a6dc-43df-b15b-0edd0b864620 failed 2023-07-22T00:03:05 2023-07-22T00:24:08
wikihow_en_endless_education-and-communications 33b9aca4-2495-4d37-9666-4ef32a9cf2fd failed 2023-07-22T07:00:51 2023-07-22T07:21:42
wikihow_en_endless_holidays-and-traditions 78ea8770-d826-47aa-b662-b4d0009e6199 failed 2023-07-24T12:01:44 2023-07-24T12:30:03
wikihow_en_endless_food-and-entertaining d439240d-e294-4f87-a618-f3d345e50205 failed 2023-07-24T18:00:59 2023-07-24T18:21:52
wikihow_en_endless_hobbies-and-crafts 9dc28007-3baa-4294-b916-0cd7eab5bb00 failed 2023-07-25T21:00:58 2023-07-25T21:21:46
wikihow_fr_maxi 45f38296-6907-472d-b76a-19f9508a1fa1 failed 2023-07-31T11:01:08 2023-07-31T11:22:13
wikihow_cs_maxi dd0ddd59-f72b-410f-ac7c-70ca140ef692 failed 2023-09-21T05:05:11 2023-09-21T05:21:27

wikihow_cs_maxi dd0ddd59-f72b-410f-ac7c-70ca140ef692 (running with 2 secs delays, "normal" setting, failed for a weird reason, it failed to download the illustration (while the URL works from my machine).

I launched the recipe wikihow_es_maxi on my own machine with 4 secs delays but I got rate limited again. I will relaunch it in the zimfarm with 8 secs delays (only API delay for now, this is where the rate limiting occurs looking at some tasks).

benoit74 commented 9 months ago

https://farm.openzim.org/pipeline/d12b352d-0498-47f1-904f-977562caaf0a is running with this 8 secs delay at the API level and "usual" 2 secs delay at the web crawler.

rgaudin commented 9 months ago

WikiHow uses the Platform features which sets conditions on concurrent tasks:

PLATFORM_wikihow_MAX_TASKS_PER_WORKER: "1"
PLATFORM_wikihow_MAX_TASKS_TOTAL: "1"
benoit74 commented 9 months ago

Task for wikihow_es is progressing normally, so:

I've updated all other tasks with this same 8 secs setting and re-enabled all of them. This 8 secs delay is huge but not very significant in term of task duration because most of it is spent in the web crawling part (even with the 8 secs delay).

Any remark or may I close the ticket?

Popolechien commented 9 months ago

They never got back to me, so I guess we can close the ticket, yes.