openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
37 stars 2 forks source link

Radiopaedia Recipe #1016

Open RavanJAltaie opened 11 months ago

RavanJAltaie commented 11 months ago

The recipe of Radiopedia.org has succeeded in hidden/dev, the file size is 2.3 GB But the internal links of the website inside the file are not working. https://farm.openzim.org/pipeline/a1f80241-f74c-4ec6-911f-b6f02f3d6d2b/debug

The recipe: https://farm.openzim.org/recipes/radiopaedia_en

Popolechien commented 11 months ago

Here is the dev zim : https://dev.library.kiwix.org/viewer#radiopaedia_en_2023-09/A/index.html

rgaudin commented 11 months ago

Most of the links shows that the scraper was presented with the CloudFlare captcha-like Security feature. This feature is usually enabled to prevent spambots and/or crawlers.

Since it's a crowd-sourced website, spam might be the reason. Licensing allows reuse.

benoit74 commented 11 months ago

@rgaudin @kelson42 Is there any strategy to apply in such situations where CloudFlare (or any other CDN indeed) blocked us? I suspect that we are just blocked and can't do anything, but I do not have your experience with browsertrix crawler.

rgaudin commented 11 months ago

No, it deserves a ticket on the crawler repo. It's different than https://github.com/webrecorder/browsertrix-crawler/issues/372 I don't think there's much that can be done technically but:

benoit74 commented 11 months ago

I will open the issue on the crawler repo.

Your point regarding how content people can identify it as such could be seen either as:

  1. a process / documentation point for content team
  2. or a technical point (we could/should detect when there are many Cloudflare "Security" pages in warc2zim and make the recipe fail)

What do you think about it? I prefer point 2 because it relies less on humans to detect these issues, but it might have a performance impact on big websites (we have to search all HTML files).

Regarding the third point, @Popolechien @RavanJAltaie I let you judge whether it makes sense contact the website manager to explain our project and ask them for alternate ways to retrieve their content automatically, or if you prefer to just give up.

RavanJAltaie commented 11 months ago

@Popolechien the issue was initiated originally by you, you can decide on the third point suggested by Renaud

rgaudin commented 11 months ago

1 and 2 are not exclusive.

You need them to identify issues in created ZIM files (it's not identifiable from source website usually) because they do QA.

We also need to detect this but as long as we want this to be unattended, we have to offer options and decide on default values for them. It's debatable but I don't think that warc2zim level handling would be of any use here because the crawl is already over and the content missing. At best you have a slightly smaller ZIM.

This would probably need to be integrated into the crawler and maybe those behavior features that we don't use are here for that reason.

In this very example, the website returned 429 Too Many Requests which, per HTTP spec should not be ignored. The scraper should pause and retry at a reduced pace (or behave differently anyway).

Popolechien commented 11 months ago

I'll drop them a line, yeah. @rgaudin @benoit74 would there be a specific ask? Or should I simply say "we're being blocked by cloudflare, do you have dumps somewhere we can use?"

benoit74 commented 11 months ago

No specific ask from my PoV

benoit74 commented 11 months ago

An update from what has been discussed on Slack / last team meeting on Friday:

benoit74 commented 11 months ago

Adding trendmd URLs to the exclude list was not productive, it does not work like that, the exclude list is used to filter out pages to crawl, not subresources retrieved during a page retrieval.

The hypothesis about "promoted articles" is probably wrong, this section is always removed from all pages, I don't get where I saw it in the ZIM.

So basically in yesterday run, we have the same result again, 38 minutes after starting to crawl, pages start to finish with timeouts.

What is a bit weird is that there is two kinds of timeouts:

{
    "timestamp": "2023-09-24T13:47:41.274Z",
    "logLevel": "error",
    "context": "general",
    "message": "Page Load Timeout, skipping page",
    "details": {
        "msg": "Navigation timeout of 90000 ms exceeded",
        "page": "https://radiopaedia.org/articles/cotrel-dubousset-instrumentation?lang=us",
        "workerid": 0
    }
}

or

{
    "timestamp": "2023-09-24T13:49:11.786Z",
    "logLevel": "warn",
    "context": "general",
    "message": "Page Loading Slowly, skipping behaviors",
    "details": {
        "msg": "Navigation timeout of 90000 ms exceeded",
        "page": "https://radiopaedia.org/articles/fetal-urinary-bladder?lang=us",
        "workerid": 0
    }
}

From what I see we never get both messages, but it seems mostly random in terms of what we get.

What I don't get is that even if the page finishes in error, it looks like the page is however present in the WARC/ZIM.

I will debug this locally to get what is going on.

benoit74 commented 11 months ago

Regarding the two messages, looking at the code it is indeed random because it depends how far we've made progress in terms of content, so by design since time-based, it can be random.

Regarding the debug, I finally got hit on my dev machine with the original issue where we get presented the Cloudflare security page. I modified the source code, and pushed a PR to browsertrix. Modified code is now running on my machine to get hit by the second problem (hopefully) regarding the fact that at some point all pages are finishing with timeouts.

benoit74 commented 11 months ago

Modified code is handling 429 errors + adding a failed limit is working well, I'm now waiting for https://github.com/webrecorder/browsertrix-crawler/pull/393 review + merge + release.

The 90000 ms timeouts on my machine happened only very rarely, for pages which were really taking more than 1.5 minutes to load due to a huge amount of images on the page (see e.g. https://radiopaedia.org/cases/89329/studies/106250?lang=us&referrer=%2Farticles%2Fosteoid-osteoma%3Flang%3Dus%23image_list_item_54988063). So basically I do not reproduce the issue that seems to happen with latest runs on zimfarm where suddenly all pages are leading to timeouts.

@rgaudin the URL above which induced a timeout makes me wonder whether we might also have an issue with the fact that it looks like the referrer is passed as a query parameter? I mean, is browsertrix capable to ignore some query parameter or will it download the resource as many times as we have a different query parameter value?

@Popolechien @kelson42 What are our limits in terms of acceptable zimit crawling durations / final ZIM size? I mean, for sure the resource is valuable, but I suspect that the crawling will be long and the resulting ZIM huge. Probably tens of GB from what I already saw. There are pages (like the one above) with really a huge amount of images from body scans. Not even sure our regular workers are capable to build such a ZIM without issues.

rgaudin commented 11 months ago

I mean, is browsertrix capable to ignore some query parameter or will it download the resource as many times as we have a different query parameter value?

It's going to be different URLs in the produced WARC and thus in the produced ZIM but that's only for pages (HTML). The resources (images) themselves are not subject to this so likely included once (as long as using unique URI) and the crawler is likely caching it so crawling time probably not affected too drastically.

Popolechien commented 11 months ago

@benoit74 I don't really have an upper limit in terms of zim size (80-100GB to align with the bigger Stack Overflow and Wikipedia?) As for crawl time, if we do it twice a year then we can probably live with a 2-3 weeks crawl (enwp takes about 2-3 weeks IIRC).

Either way, since we are likely to encounter other such cases in the future I'd favour giving a try on this once just to have a benchmark.

benoit74 commented 3 months ago

I just started again the recipe with 2.0 dev version

benoit74 commented 3 months ago

We are still blocked by a crawler, nothing much to do for now.