Open zach-sb opened 6 months ago
@zach-sb we've investigated, the website is served by CloudFront (AWS CDN), probably protected by any AWS solution (equivalent to CloudFront). Do you happen to know the website owners? we need to put our IPs in their Whitelist/
It has been decided that zim-request must contain only new requests, let's transfer this to mwoffliner repo
sorry, this is a zimit thing.
I've updated the recipe to use Zimit2 test image to see how this enhance the situation. I'm moving this back to zim-requests since this is not a new scraper problem.
@benoit74 Shouldn't the zim file be retired from the library, or is the zimit2 release so close that it'll overwrite it soon?
It is publishing in dev for now. Anyway, task has failed again, blocked right away by the website.
Zimit2 and Crawler 1.x won't help, we are still blocked by the website. Only solution will be to obtain a whitelisting from website owner. Any point of contact?
When using the TV Tropes tvtropes_en_all_2023-09.zim that I downloaded today, many pages result in what looks like a TV Tropes 403 error.
For example, starting on the home page:![image](https://github.com/openzim/zim-requests/assets/30288956/aa8fc22f-26b1-473a-a75f-c5c1b33ca8d4)
Click on "God", then the first link from there is "The Alpha and the Omega", which results in a 403. Same thing happens if you start with "great power", Then "Powers that Be". Or anything else on that page that I tried.
This happens with a kiwix-server docker setup, and on the library.kiwix.org website