openzim / wikihow

WikiHow scraper
https://download.kiwix.org/zim/wikihow/
GNU General Public License v3.0
15 stars 2 forks source link

Full wikihow triggers block #121

Closed kelson42 closed 2 years ago

kelson42 commented 2 years ago

https://farm.openzim.org/pipeline/2e20106943b8f39a67434326

kelson42 commented 2 years ago

Similar scenario for Wikihow PT, just dies at the very end https://farm.openzim.org/pipeline/e52110095e40a4ec5c222326/debug

rgaudin commented 2 years ago

Nothing like “just dying at the very end”. What made you think that?

We're hitting wikihow's traffic protection mechanism: when getting too many requests, the server stops accepting connections so we're getting timeouts.

The code which ran PT (I changed it between ES and PT) already has a large number of retries (10) with a large increasing sleep time in between (30s * attempt-nb). This means that for this error to happen, the scraper would have attempted retries for more than 5mn.

As we don't know how the server is configured, it's hard to guess what we should be doing exactly.

I'd advise we un-requests all wikihows and switch one to 2s interval and see what happens. What do you think?

kelson42 commented 2 years ago

@rgaudin Agree

kelson42 commented 2 years ago

WikiHow ES passes with 2/2 delays