Open SecularSekai opened 6 months ago
@SecularSekai Hey can you please drop a line at hello@kiwix.org for the permission?
@Popolechien please let up know here once permission is sent so we can start creating the recipe. @SecularSekai
@Popolechien @RavanJAltaie Sorry for the wait! I notified the copyright holder who has informed me that she sent an email providing permission to hello@kiwix.org If you need any additional information, please let me know. Thanks, guys!
@RavanJAltaie We're good to go - fingers crossed.
https://farm.openzim.org/recipes/atheistrepublic_en_all Recipe created, will update the library link here once ready
@RavanJAltaie Fantastic! How can I access it and view the ZIM locally on Kiwix Desktop?
@RavanJAltaie @Popolechien Hi, guys! It looks like the recipe failed when I check the link. What would be the next step to troubleshoot it?
@benoit74 the recipe is failing with error:
File "/usr/bin/zimit", line 566, in <module> zimit() File "/usr/bin/zimit", line 437, in zimit raise subprocess.CalledProcessError(crawl.returncode, cmd_args) subprocess.CalledProcessError: Command '['crawl', '--failOnFailedSeed', '--waitUntil', 'load', '--title', 'Atheist Republic', '--description', 'We are not just atheists, we are atheists who care.', '--depth', '-1', '--timeout', '90', '--scopeType', 'domain', '--lang', 'eng', '--behaviors', 'autoplay,autofetch,siteSpecific', '--behaviorTimeout', '90', '--diskUtilization', '90', '--url', 'https://www.atheistrepublic.com/', '--userAgent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15 +Zimit contact+zimfarm@kiwix.org', '--cwd', '/output/.tmpmypky3kv', '--statsFilename', '/output/crawl.json']' returned non-zero exit status 9.
Any ideas?
It looks after some time the scraper has been banned due to too many requests.
However, digging a bit into the logs it is clear that we spend a lot of time (more than 90% I would say) trying to archive the store items which is probably not intentional. I don't think this is needed to be present in the ZIM.
I suggest that we exclude all https://www.atheistrepublic.com/store
URLs.
I also noticed there is a forum which is hosted both on https://www.atheistrepublic.com/forums/ (old forum if I get it right) and https://forum.atheistrepublic.com/ (new forum). New forum seems to be based on Discourse platform, which will probably be difficult to scrape (we might be blocked quite soon).
@SecularSekai do you want the forums to be archived in the ZIM as well or is it not needed or not mandatory?
@benoit74 what do you think we should do? Shall we tag this as upstream or reject the issue?
@benoit74 Hi! Sorry I didn't see your comment back in February until now. The forums are less important, so they can be omitted if need be for the ZIM
We really appreciate all the help with this!
@benoit74 shall we try create it without the forums?
Next steps are:
Do we all agree that we do not want store content inside the ZIM?
@benoit74 I agree on all points. @SecularSekai do you agree?
@RavanJAltaie @benoit74 I agree and appreciate the thorough thought on this. Let's move forward with that strategy and see where we get. Loss of the forums is not a major problem.
@RavanJAltaie do you need help to remove the store items?
@benoit74 yes please, you can let me know how to be done and I'll do it.
I've updated the recipe exclude criteria:
This will exclude store items for now. Forum should be archived since it is on the same domain and scopeType is set to domain. Let's see if it achieves to properly archive the forum as well. I've requested the recipe.
Please use the following format for a ZIM creation request (and delete unnecessary information)