openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
35 stars 2 forks source link

New request: Atheist Republic #761

Open SecularSekai opened 6 months ago

SecularSekai commented 6 months ago

Please use the following format for a ZIM creation request (and delete unnecessary information)

Popolechien commented 6 months ago

@SecularSekai Hey can you please drop a line at hello@kiwix.org for the permission?

RavanJAltaie commented 6 months ago

@Popolechien please let up know here once permission is sent so we can start creating the recipe. @SecularSekai

SecularSekai commented 6 months ago

@Popolechien @RavanJAltaie Sorry for the wait! I notified the copyright holder who has informed me that she sent an email providing permission to hello@kiwix.org If you need any additional information, please let me know. Thanks, guys!

Popolechien commented 6 months ago

@RavanJAltaie We're good to go - fingers crossed.

RavanJAltaie commented 6 months ago

https://farm.openzim.org/recipes/atheistrepublic_en_all Recipe created, will update the library link here once ready

SecularSekai commented 6 months ago

@RavanJAltaie Fantastic! How can I access it and view the ZIM locally on Kiwix Desktop?

SecularSekai commented 6 months ago

@RavanJAltaie @Popolechien Hi, guys! It looks like the recipe failed when I check the link. What would be the next step to troubleshoot it?

RavanJAltaie commented 4 months ago

@benoit74 the recipe is failing with error: File "/usr/bin/zimit", line 566, in <module> zimit() File "/usr/bin/zimit", line 437, in zimit raise subprocess.CalledProcessError(crawl.returncode, cmd_args) subprocess.CalledProcessError: Command '['crawl', '--failOnFailedSeed', '--waitUntil', 'load', '--title', 'Atheist Republic', '--description', 'We are not just atheists, we are atheists who care.', '--depth', '-1', '--timeout', '90', '--scopeType', 'domain', '--lang', 'eng', '--behaviors', 'autoplay,autofetch,siteSpecific', '--behaviorTimeout', '90', '--diskUtilization', '90', '--url', 'https://www.atheistrepublic.com/', '--userAgent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15 +Zimit contact+zimfarm@kiwix.org', '--cwd', '/output/.tmpmypky3kv', '--statsFilename', '/output/crawl.json']' returned non-zero exit status 9. Any ideas?

benoit74 commented 4 months ago

It looks after some time the scraper has been banned due to too many requests.

However, digging a bit into the logs it is clear that we spend a lot of time (more than 90% I would say) trying to archive the store items which is probably not intentional. I don't think this is needed to be present in the ZIM.

I suggest that we exclude all https://www.atheistrepublic.com/store URLs.

I also noticed there is a forum which is hosted both on https://www.atheistrepublic.com/forums/ (old forum if I get it right) and https://forum.atheistrepublic.com/ (new forum). New forum seems to be based on Discourse platform, which will probably be difficult to scrape (we might be blocked quite soon).

@SecularSekai do you want the forums to be archived in the ZIM as well or is it not needed or not mandatory?

RavanJAltaie commented 2 months ago

@benoit74 what do you think we should do? Shall we tag this as upstream or reject the issue?

SecularSekai commented 2 months ago

@benoit74 Hi! Sorry I didn't see your comment back in February until now. The forums are less important, so they can be omitted if need be for the ZIM

We really appreciate all the help with this!

RavanJAltaie commented 2 months ago

@benoit74 shall we try create it without the forums?

benoit74 commented 2 months ago

Next steps are:

Do we all agree that we do not want store content inside the ZIM?

RavanJAltaie commented 2 months ago

@benoit74 I agree on all points. @SecularSekai do you agree?

SecularSekai commented 2 months ago

@RavanJAltaie @benoit74 I agree and appreciate the thorough thought on this. Let's move forward with that strategy and see where we get. Loss of the forums is not a major problem.

benoit74 commented 2 months ago

@RavanJAltaie do you need help to remove the store items?

RavanJAltaie commented 1 month ago

@benoit74 yes please, you can let me know how to be done and I'll do it.

benoit74 commented 1 month ago

I've updated the recipe exclude criteria:

image

This will exclude store items for now. Forum should be archived since it is on the same domain and scopeType is set to domain. Let's see if it achieves to properly archive the forum as well. I've requested the recipe.