openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
35 stars 2 forks source link

New request: forums.gentoo.org #1057

Open vitaly-zdanevich opened 1 week ago

vitaly-zdanevich commented 1 week ago
RavanJAltaie commented 5 days ago

Recipe created https://farm.openzim.org/recipes/forums.gentoo.org_en_all I'll update the library link once ready

AngryLoki commented 3 days ago

@RavanJAltaie , why on https://farm.openzim.org/pipeline/e0b8e527-0cab-4514-880b-9434ea0a32b2/debug I see:

{"workerid":0,"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%7D%7D
{"timestamp":"2024-06-25T15:49:46.110Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":49103,"total":960486,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-06-25T15:49:46.109Z\",\"extraHops\":0,\"url\":\"https:\\/\\/forums.gentoo.org\\/posting.php?mode=reply&t=193199\",\"added\":\"2024-06-23T23:18:07.166Z\",\"depth\":3}"]}}
{"timestamp":"2024-06-25T15:49:46.309Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%2C%22workerid%22%3A0%7D%7D

...

{"timestamp":"2024-06-25T15:50:34.199Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://forums.gentoo.org/login.php?redirect=posting.php&mode=reply&t=1112802%22%2C%22workerid%22%3A0%7D%7D

/login.php and /posting.php are disallowed by https://forums.gentoo.org/robots.txt

RavanJAltaie commented 2 days ago

@benoit74 could you please check the above?

benoit74 commented 2 days ago

I'm not sure what you wanna me to check.

What I can state is that:

Does it answer your request?

benoit74 commented 2 days ago

FYI, I've opened https://github.com/webrecorder/browsertrix-crawler/issues/631 to discuss the second point with webrecorder folks.

benoit74 commented 2 days ago

Nota: I've cancelled the task and disabled the recipe for now, since configuration is wrong, more than 1M pages have been found, there is no point in continuing to crawl with current configuration. Let's discuss this this afternoon.