openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
331 stars 24 forks source link

Can't Zim a Wiki #255

Closed pfspace closed 3 months ago

pfspace commented 9 months ago

Hi there,

I'm developing a simple 2D platformer game and due to it's poor performance on low end hardware I decided to take a break from the engine I was building my game on and build my own engine from scratch in C and SDL.

Then I learned about that community game called Sonic Robo Blast 2, which is based on the Doom Legacy engine and decided to learn more about this project - which is really impressive stuff, by the way - and keep an offline copy of their wiki for studying and reference.

I tried to zim their wiki from youzim.it, but the site fails and tells me to open an issue. Here I am. Here is the wiki link: wiki.srb2.org

Thank you in advance.

benoit74 commented 9 months ago

We will have a look, sorry about the inconvenience and thank you very much for your interest and support

pfspace commented 9 months ago

There is nothing to apologize for. Thank you very much for your attention and congratulations for your great work on ZimIt.

benoit74 commented 9 months ago

I confirm this is a scraper issue. This is not the first occurrence, so I've opened https://github.com/openzim/zimit/issues/256 to track and solve this issue. I will keep you informed here as well once we've made progress.

pfspace commented 9 months ago

Thank you very much.

benoit74 commented 7 months ago

In fact the problem is that the wiki is protected by Cloudflare. And Cloudflare consider we are a bot scraping the website (which is not that wrong). Unless you know the site admin and can to them ask to whitelist of our IP, there is probably not much we can do.

pfspace commented 7 months ago

I don't know them. I'm not a member of the community, just learned of this project recently. Anyway, I understand the situation. Thank you very much for your efforts and attention.

Popolechien commented 7 months ago

@pfspace Drop them an email explaining the issue? We're actually looking for someone to work with to develop a proper whitelisting procedure, and people who start a wiki are usually collaborative-minded.

pfspace commented 7 months ago

Sure, I can try.

Logan-A commented 7 months ago

Hello, I am one of the people that run the wiki at https://wiki.srb2.org/

I am looking into our logs

alama commented 7 months ago

Sorry, but we have blocked AS12876 due to forum spam going to https://mb.srb2.org/ coming that that datacenter and I am not going to remove this blockage.

benoit74 commented 7 months ago

@alama @Logan-A

We have various workers, donated by various volunteers across various machines all around the globe (most of these machines are not ours), so it is true that removing the whole AS12876 is really not appropriate (we do not control the whole AS at all, we do not even have full control on the machine in same cases) and not sufficient (next time we might run the task on a different worker, probably on a different AS).

I would prefer that we test (if possible for you, of course) the whitelisting of one single worker IP for now, on a machine we have full control over (so that I can guarantee you won't get other traffic from this IP) and I will ensure the next job is ran on this machine.

Is it correct that it is a configuration you do in Cloudflare? How do you do this, in the WAF? It is not that important for this specific test, but we would like to gain knowledge on what is possible with the various WAF / protection systems of the market (at least main ones like Cloudflare) so that we have clear procedures of what to do for next cases.

Thank you anyway for your cooperation on this, much appreciated!

benoit74 commented 7 months ago

PS: is it an issue if the IP I give you is an the AS12876? This is both a technical question (give a higher priority to the ALLOW rule or something like that) and a non-technical one (is it ok for you). The AS12876 is used by a French Cloud provider (Scaleway) at which we rent a machine, but for sure you have very varied traffic / stuff running on their 475,136 IPv4 (not speaking about IPv6 ...).

Logan-A commented 7 months ago

I have unblocked AS12876, and am now trying to zim wiki.srb2.org via youzim.it

pfspace commented 7 months ago

I have unblocked AS12876, and am now trying to zim wiki.srb2.org via youzim.it

Thank you very much for your help and attention.

benoit74 commented 7 months ago

@Logan-A Your youzim.it task has been successful : https://farm.youzim.it/pipeline/7f652142-39fe-4228-9678-8550c726c44d

It took a bit of time because there was quite a lot of jobs in the pipe when you requested the job (pipe is now empty ATM).

Unfortunately the ZIM is not complete because has been throttled after 2 hours of scraping. Only ~1400 pages have been scrapped out of ~18000 pages discovered by the scraper so far.

You might want to apply for a zim-request (open an Issue in https://github.com/openzim/zim-requests) so that we create the ZIM on our regular workers (but IP and AS will probably change) which have no time or size limit, plus we will update the ZIM regularly. We have some policies around which content we consider for inclusion in our set of ZIMs, but I think you might qualify since we already have done a ZIM of a pokemon wiki.

pfspace commented 7 months ago

Thank you all for your support, efforts and attention.

benoit74 commented 3 months ago

Nothing left to do on scraper side, closing this