tvtropes (zimit) cannot be scraped anymore

openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!

https://farm.openzim.org

35 stars 2 forks source link

tvtropes (zimit) cannot be scraped anymore #1007

Open zach-sb opened 6 months ago

zach-sb commented 6 months ago

When using the TV Tropes tvtropes_en_all_2023-09.zim that I downloaded today, many pages result in what looks like a TV Tropes 403 error.

For example, starting on the home page:

Click on "God", then the first link from there is "The Alpha and the Omega", which results in a 403. Same thing happens if you start with "great power", Then "Powers that Be". Or anything else on that page that I tried.

This happens with a kiwix-server docker setup, and on the library.kiwix.org website

RavanJAltaie commented 5 months ago

@zach-sb we've investigated, the website is served by CloudFront (AWS CDN), probably protected by any AWS solution (equivalent to CloudFront). Do you happen to know the website owners? we need to put our IPs in their Whitelist/

benoit74 commented 5 months ago

It has been decided that zim-request must contain only new requests, let's transfer this to mwoffliner repo

benoit74 commented 5 months ago

sorry, this is a zimit thing.

benoit74 commented 2 months ago

I've updated the recipe to use Zimit2 test image to see how this enhance the situation. I'm moving this back to zim-requests since this is not a new scraper problem.

Popolechien commented 2 months ago

@benoit74 Shouldn't the zim file be retired from the library, or is the zimit2 release so close that it'll overwrite it soon?

benoit74 commented 2 months ago

It is publishing in dev for now. Anyway, task has failed again, blocked right away by the website.

benoit74 commented 1 month ago

Zimit2 and Crawler 1.x won't help, we are still blocked by the website. Only solution will be to obtain a whitelisting from website owner. Any point of contact?