Radiopaedia Recipe - Githubissues

RavanJAltaie commented 1 year ago

The recipe of Radiopedia.org has succeeded in hidden/dev, the file size is 2.3 GB But the internal links of the website inside the file are not working. https://farm.openzim.org/pipeline/a1f80241-f74c-4ec6-911f-b6f02f3d6d2b/debug

The recipe: https://farm.openzim.org/recipes/radiopaedia_en

Popolechien commented 1 year ago

Here is the dev zim : https://dev.library.kiwix.org/viewer#radiopaedia_en_2023-09/A/index.html

rgaudin commented 1 year ago

Most of the links shows that the scraper was presented with the CloudFlare captcha-like Security feature. This feature is usually enabled to prevent spambots and/or crawlers.

Since it's a crowd-sourced website, spam might be the reason. Licensing allows reuse.

benoit74 commented 1 year ago

@rgaudin @kelson42 Is there any strategy to apply in such situations where CloudFlare (or any other CDN indeed) blocked us? I suspect that we are just blocked and can't do anything, but I do not have your experience with browsertrix crawler.

rgaudin commented 1 year ago

No, it deserves a ticket on the crawler repo. It's different than https://github.com/webrecorder/browsertrix-crawler/issues/372 I don't think there's much that can be done technically but:

we should document that this is a known limitation.
Content people should be able to identify it as such.
When worth it (radiopedia specifically allows reuse), maybe contact website manager.

benoit74 commented 1 year ago

I will open the issue on the crawler repo.

Your point regarding how content people can identify it as such could be seen either as:

a process / documentation point for content team
or a technical point (we could/should detect when there are many Cloudflare "Security" pages in warc2zim and make the recipe fail)

What do you think about it? I prefer point 2 because it relies less on humans to detect these issues, but it might have a performance impact on big websites (we have to search all HTML files).

Regarding the third point, @Popolechien @RavanJAltaie I let you judge whether it makes sense contact the website manager to explain our project and ask them for alternate ways to retrieve their content automatically, or if you prefer to just give up.

RavanJAltaie commented 1 year ago

@Popolechien the issue was initiated originally by you, you can decide on the third point suggested by Renaud

rgaudin commented 1 year ago

1 and 2 are not exclusive.

You need them to identify issues in created ZIM files (it's not identifiable from source website usually) because they do QA.

We also need to detect this but as long as we want this to be unattended, we have to offer options and decide on default values for them. It's debatable but I don't think that warc2zim level handling would be of any use here because the crawl is already over and the content missing. At best you have a slightly smaller ZIM.

This would probably need to be integrated into the crawler and maybe those behavior features that we don't use are here for that reason.

In this very example, the website returned 429 Too Many Requests which, per HTTP spec should not be ignored. The scraper should pause and retry at a reduced pace (or behave differently anyway).

Popolechien commented 1 year ago

I'll drop them a line, yeah. @rgaudin @benoit74 would there be a specific ask? Or should I simply say "we're being blocked by cloudflare, do you have dumps somewhere we can use?"

benoit74 commented 1 year ago

No specific ask from my PoV

benoit74 commented 1 year ago

An update from what has been discussed on Slack / last team meeting on Friday:

We suspected there was a solution to Cloudflare by use a more recent Chrome version, but this is wrong, zimit 1.5.2 is OK if setup properly
Radiopaedia recipe was failling mostly because of the workers parameter which was set to 6. This is way too aggressive, it is like if there is 6 tabs of your browser crawling the website at the same time, without any pause. The website could be overwhelmed by this level of traffic or CDN like Cloudflare could ban our crawler. This parameter must be kept its default value of 1 (which is the youzim.it value as well) except if we have good reasons not to do so (@RavanJAltaie: please review all zimit recipes and reset this parameter to its default value)
Setting the parameter to 1 showed that it allows to remain undetected by Cloudflare (or at least it lets us proceed normally)
After some hours of crawling, all pages are however finishing with a timeout after 90s; this means that the scraper is taking ages to finish
The content of https://dev.library.kiwix.org/viewer#radiopaedia_en_2023-09/ is now pretty good, but only 1129 pages out of 18876 have been crawled ; content of pages in timeout seems to be pretty OK:
- see for instance https://dev.library.kiwix.org/content/radiopaedia_en_2023-09/A/radiopaedia.org/articles/articles-on-conditions-that-affect-multiple-systems?lang=us compared to https://radiopaedia.org/articles/articles-on-conditions-that-affect-multiple-systems?lang=us
- it looks like it is the "promoted articles" section which is causing an issue, this content seems to be provided by an external partner (trendmd)
I launched again the recipe with an exclude for trendmd URLs, let's see what it gives

benoit74 commented 1 year ago

Adding trendmd URLs to the exclude list was not productive, it does not work like that, the exclude list is used to filter out pages to crawl, not subresources retrieved during a page retrieval.

The hypothesis about "promoted articles" is probably wrong, this section is always removed from all pages, I don't get where I saw it in the ZIM.

So basically in yesterday run, we have the same result again, 38 minutes after starting to crawl, pages start to finish with timeouts.

What is a bit weird is that there is two kinds of timeouts:

{
    "timestamp": "2023-09-24T13:47:41.274Z",
    "logLevel": "error",
    "context": "general",
    "message": "Page Load Timeout, skipping page",
    "details": {
        "msg": "Navigation timeout of 90000 ms exceeded",
        "page": "https://radiopaedia.org/articles/cotrel-dubousset-instrumentation?lang=us",
        "workerid": 0
    }
}

or

{
    "timestamp": "2023-09-24T13:49:11.786Z",
    "logLevel": "warn",
    "context": "general",
    "message": "Page Loading Slowly, skipping behaviors",
    "details": {
        "msg": "Navigation timeout of 90000 ms exceeded",
        "page": "https://radiopaedia.org/articles/fetal-urinary-bladder?lang=us",
        "workerid": 0
    }
}

From what I see we never get both messages, but it seems mostly random in terms of what we get.

What I don't get is that even if the page finishes in error, it looks like the page is however present in the WARC/ZIM.

I will debug this locally to get what is going on.

benoit74 commented 1 year ago

Regarding the two messages, looking at the code it is indeed random because it depends how far we've made progress in terms of content, so by design since time-based, it can be random.

Regarding the debug, I finally got hit on my dev machine with the original issue where we get presented the Cloudflare security page. I modified the source code, and pushed a PR to browsertrix. Modified code is now running on my machine to get hit by the second problem (hopefully) regarding the fact that at some point all pages are finishing with timeouts.

benoit74 commented 1 year ago

Modified code is handling 429 errors + adding a failed limit is working well, I'm now waiting for https://github.com/webrecorder/browsertrix-crawler/pull/393 review + merge + release.

The 90000 ms timeouts on my machine happened only very rarely, for pages which were really taking more than 1.5 minutes to load due to a huge amount of images on the page (see e.g. https://radiopaedia.org/cases/89329/studies/106250?lang=us&referrer=%2Farticles%2Fosteoid-osteoma%3Flang%3Dus%23image_list_item_54988063). So basically I do not reproduce the issue that seems to happen with latest runs on zimfarm where suddenly all pages are leading to timeouts.

@rgaudin the URL above which induced a timeout makes me wonder whether we might also have an issue with the fact that it looks like the referrer is passed as a query parameter? I mean, is browsertrix capable to ignore some query parameter or will it download the resource as many times as we have a different query parameter value?

@Popolechien @kelson42 What are our limits in terms of acceptable zimit crawling durations / final ZIM size? I mean, for sure the resource is valuable, but I suspect that the crawling will be long and the resulting ZIM huge. Probably tens of GB from what I already saw. There are pages (like the one above) with really a huge amount of images from body scans. Not even sure our regular workers are capable to build such a ZIM without issues.

rgaudin commented 1 year ago

I mean, is browsertrix capable to ignore some query parameter or will it download the resource as many times as we have a different query parameter value?

It's going to be different URLs in the produced WARC and thus in the produced ZIM but that's only for pages (HTML). The resources (images) themselves are not subject to this so likely included once (as long as using unique URI) and the crawler is likely caching it so crawling time probably not affected too drastically.

Popolechien commented 1 year ago

@benoit74 I don't really have an upper limit in terms of zim size (80-100GB to align with the bigger Stack Overflow and Wikipedia?) As for crawl time, if we do it twice a year then we can probably live with a 2-3 weeks crawl (enwp takes about 2-3 weeks IIRC).

Either way, since we are likely to encounter other such cases in the future I'd favour giving a try on this once just to have a benchmark.

benoit74 commented 6 months ago

I just started again the recipe with 2.0 dev version

benoit74 commented 6 months ago

We are still blocked by a crawler, nothing much to do for now.

openzim / zim-requests

Radiopaedia Recipe #1016