Open vitaly-zdanevich opened 1 week ago
Recipe created https://farm.openzim.org/recipes/forums.gentoo.org_en_all I'll update the library link once ready
@RavanJAltaie , why on https://farm.openzim.org/pipeline/e0b8e527-0cab-4514-880b-9434ea0a32b2/debug I see:
{"workerid":0,"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%7D%7D
{"timestamp":"2024-06-25T15:49:46.110Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":49103,"total":960486,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-06-25T15:49:46.109Z\",\"extraHops\":0,\"url\":\"https:\\/\\/forums.gentoo.org\\/posting.php?mode=reply&t=193199\",\"added\":\"2024-06-23T23:18:07.166Z\",\"depth\":3}"]}}
{"timestamp":"2024-06-25T15:49:46.309Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%2C%22workerid%22%3A0%7D%7D
...
{"timestamp":"2024-06-25T15:50:34.199Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://forums.gentoo.org/login.php?redirect=posting.php&mode=reply&t=1112802%22%2C%22workerid%22%3A0%7D%7D
/login.php and /posting.php are disallowed by https://forums.gentoo.org/robots.txt
@benoit74 could you please check the above?
I'm not sure what you wanna me to check.
What I can state is that:
robots.txt
is requesting to ignore, you have no exclude
parameterrobots.txt
file Disallow
rules are automatically respected (at least by default)Does it answer your request?
FYI, I've opened https://github.com/webrecorder/browsertrix-crawler/issues/631 to discuss the second point with webrecorder folks.
Nota: I've cancelled the task and disabled the recipe for now, since configuration is wrong, more than 1M pages have been found, there is no point in continuing to crawl with current configuration. Let's discuss this this afternoon.