New request: forums.gentoo.org

vitaly-zdanevich commented 1 week ago

Website URL: https://forums.gentoo.org
License: I believe its free
Desired ZIM Title: forums.gentoo.org
Desired ZIM Description: phpBB official Gentoo forum
Desired ZIM Icon –png (URL or attach one): https://www.gentoo.org/favicon.ico
Language (ISO 639-3): eng
Is this a MediaWiki?: no

RavanJAltaie commented 5 days ago

Recipe created https://farm.openzim.org/recipes/forums.gentoo.org_en_all I'll update the library link once ready

AngryLoki commented 3 days ago

@RavanJAltaie , why on https://farm.openzim.org/pipeline/e0b8e527-0cab-4514-880b-9434ea0a32b2/debug I see:

{"workerid":0,"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%7D%7D
{"timestamp":"2024-06-25T15:49:46.110Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":49103,"total":960486,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-06-25T15:49:46.109Z\",\"extraHops\":0,\"url\":\"https:\\/\\/forums.gentoo.org\\/posting.php?mode=reply&t=193199\",\"added\":\"2024-06-23T23:18:07.166Z\",\"depth\":3}"]}}
{"timestamp":"2024-06-25T15:49:46.309Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%2C%22workerid%22%3A0%7D%7D

...

{"timestamp":"2024-06-25T15:50:34.199Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://forums.gentoo.org/login.php?redirect=posting.php&mode=reply&t=1112802%22%2C%22workerid%22%3A0%7D%7D

/login.php and /posting.php are disallowed by https://forums.gentoo.org/robots.txt

RavanJAltaie commented 2 days ago

@benoit74 could you please check the above?

benoit74 commented 2 days ago

I'm not sure what you wanna me to check.

What I can state is that:

your current recipe configuration is not in-line with what the robots.txt is requesting to ignore, you have no exclude parameter
there is nothing in browsertrix crawler ensuring that the robots.txt file Disallow rules are automatically respected (at least by default)

Does it answer your request?

benoit74 commented 2 days ago

FYI, I've opened https://github.com/webrecorder/browsertrix-crawler/issues/631 to discuss the second point with webrecorder folks.

benoit74 commented 2 days ago

Nota: I've cancelled the task and disabled the recipe for now, since configuration is wrong, more than 1M pages have been found, there is no point in continuing to crawl with current configuration. Let's discuss this this afternoon.

openzim / zim-requests

New request: forums.gentoo.org #1057