Open RavanJAltaie opened 1 year ago
Here is the dev zim : https://dev.library.kiwix.org/viewer#radiopaedia_en_2023-09/A/index.html
Most of the links shows that the scraper was presented with the CloudFlare captcha-like Security feature. This feature is usually enabled to prevent spambots and/or crawlers.
Since it's a crowd-sourced website, spam might be the reason. Licensing allows reuse.
@rgaudin @kelson42 Is there any strategy to apply in such situations where CloudFlare (or any other CDN indeed) blocked us? I suspect that we are just blocked and can't do anything, but I do not have your experience with browsertrix crawler.
No, it deserves a ticket on the crawler repo. It's different than https://github.com/webrecorder/browsertrix-crawler/issues/372 I don't think there's much that can be done technically but:
I will open the issue on the crawler repo.
Your point regarding how content people can identify it as such could be seen either as:
What do you think about it? I prefer point 2 because it relies less on humans to detect these issues, but it might have a performance impact on big websites (we have to search all HTML files).
Regarding the third point, @Popolechien @RavanJAltaie I let you judge whether it makes sense contact the website manager to explain our project and ask them for alternate ways to retrieve their content automatically, or if you prefer to just give up.
@Popolechien the issue was initiated originally by you, you can decide on the third point suggested by Renaud
1
and 2
are not exclusive.
You need them to identify issues in created ZIM files (it's not identifiable from source website usually) because they do QA.
We also need to detect this but as long as we want this to be unattended, we have to offer options and decide on default values for them. It's debatable but I don't think that warc2zim level handling would be of any use here because the crawl is already over and the content missing. At best you have a slightly smaller ZIM.
This would probably need to be integrated into the crawler and maybe those behavior features that we don't use are here for that reason.
In this very example, the website returned 429 Too Many Requests
which, per HTTP spec should not be ignored. The scraper should pause and retry at a reduced pace (or behave differently anyway).
I'll drop them a line, yeah. @rgaudin @benoit74 would there be a specific ask? Or should I simply say "we're being blocked by cloudflare, do you have dumps somewhere we can use?"
No specific ask from my PoV
An update from what has been discussed on Slack / last team meeting on Friday:
workers
parameter which was set to 6. This is way too aggressive, it is like if there is 6 tabs of your browser crawling the website at the same time, without any pause. The website could be overwhelmed by this level of traffic or CDN like Cloudflare could ban our crawler. This parameter must be kept its default value of 1 (which is the youzim.it value as well) except if we have good reasons not to do so (@RavanJAltaie: please review all zimit recipes and reset this parameter to its default value)Adding trendmd URLs to the exclude list was not productive, it does not work like that, the exclude list is used to filter out pages to crawl, not subresources retrieved during a page retrieval.
The hypothesis about "promoted articles" is probably wrong, this section is always removed from all pages, I don't get where I saw it in the ZIM.
So basically in yesterday run, we have the same result again, 38 minutes after starting to crawl, pages start to finish with timeouts.
What is a bit weird is that there is two kinds of timeouts:
{
"timestamp": "2023-09-24T13:47:41.274Z",
"logLevel": "error",
"context": "general",
"message": "Page Load Timeout, skipping page",
"details": {
"msg": "Navigation timeout of 90000 ms exceeded",
"page": "https://radiopaedia.org/articles/cotrel-dubousset-instrumentation?lang=us",
"workerid": 0
}
}
or
{
"timestamp": "2023-09-24T13:49:11.786Z",
"logLevel": "warn",
"context": "general",
"message": "Page Loading Slowly, skipping behaviors",
"details": {
"msg": "Navigation timeout of 90000 ms exceeded",
"page": "https://radiopaedia.org/articles/fetal-urinary-bladder?lang=us",
"workerid": 0
}
}
From what I see we never get both messages, but it seems mostly random in terms of what we get.
What I don't get is that even if the page finishes in error, it looks like the page is however present in the WARC/ZIM.
I will debug this locally to get what is going on.
Regarding the two messages, looking at the code it is indeed random because it depends how far we've made progress in terms of content, so by design since time-based, it can be random.
Regarding the debug, I finally got hit on my dev machine with the original issue where we get presented the Cloudflare security page. I modified the source code, and pushed a PR to browsertrix. Modified code is now running on my machine to get hit by the second problem (hopefully) regarding the fact that at some point all pages are finishing with timeouts.
Modified code is handling 429 errors + adding a failed limit is working well, I'm now waiting for https://github.com/webrecorder/browsertrix-crawler/pull/393 review + merge + release.
The 90000 ms timeouts on my machine happened only very rarely, for pages which were really taking more than 1.5 minutes to load due to a huge amount of images on the page (see e.g. https://radiopaedia.org/cases/89329/studies/106250?lang=us&referrer=%2Farticles%2Fosteoid-osteoma%3Flang%3Dus%23image_list_item_54988063). So basically I do not reproduce the issue that seems to happen with latest runs on zimfarm where suddenly all pages are leading to timeouts.
@rgaudin the URL above which induced a timeout makes me wonder whether we might also have an issue with the fact that it looks like the referrer is passed as a query parameter? I mean, is browsertrix capable to ignore some query parameter or will it download the resource as many times as we have a different query parameter value?
@Popolechien @kelson42 What are our limits in terms of acceptable zimit crawling durations / final ZIM size? I mean, for sure the resource is valuable, but I suspect that the crawling will be long and the resulting ZIM huge. Probably tens of GB from what I already saw. There are pages (like the one above) with really a huge amount of images from body scans. Not even sure our regular workers are capable to build such a ZIM without issues.
I mean, is browsertrix capable to ignore some query parameter or will it download the resource as many times as we have a different query parameter value?
It's going to be different URLs in the produced WARC and thus in the produced ZIM but that's only for pages (HTML). The resources (images) themselves are not subject to this so likely included once (as long as using unique URI) and the crawler is likely caching it so crawling time probably not affected too drastically.
@benoit74 I don't really have an upper limit in terms of zim size (80-100GB to align with the bigger Stack Overflow and Wikipedia?) As for crawl time, if we do it twice a year then we can probably live with a 2-3 weeks crawl (enwp takes about 2-3 weeks IIRC).
Either way, since we are likely to encounter other such cases in the future I'd favour giving a try on this once just to have a benchmark.
I just started again the recipe with 2.0 dev version
We are still blocked by a crawler, nothing much to do for now.
The recipe of Radiopedia.org has succeeded in hidden/dev, the file size is 2.3 GB But the internal links of the website inside the file are not working. https://farm.openzim.org/pipeline/a1f80241-f74c-4ec6-911f-b6f02f3d6d2b/debug
The recipe: https://farm.openzim.org/recipes/radiopaedia_en