openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
35 stars 2 forks source link

New request: NDLA - Norwegian Digital Learning Arena #626

Open tronba opened 1 year ago

tronba commented 1 year ago

Please use the following format for a ZIM creation request (and delete unnecessary information)

Popolechien commented 1 year ago

@tronba Thanks. There's no off-the-shelf way for us to parse out copyrighted images. Would the lessons still work without these illustrations (much like wikipedia without images is still 99% Wikipedia)?

tronba commented 1 year ago

@Popolechien Good day; the copyrighted images are almost exclusively stock photos at the top /middle of articles. There to make things pretty, but not crucial for the content. Where illustrations are needed for the texts, these are mainly custom-made and CC-lisenced, a lot of the "pretty" images are also CC. The site with no pictures would work (have value), but it would not be the best since it's teaching resources that often refer to things shown in images. The site without the copyrighted pictures removed would work well.

tronba commented 1 year ago

@Popolechien I did a spot-check of 20 random articles from the site. 17 used CC images only. The last 3 had one copyright picture per article (non of them essential for article content). The site has about 35 000 articles (if I'm not mistaken).

Popolechien commented 1 year ago

@RavanJAltaie can you please make a first pass, and add https://ndla.no/subject:20 in the zimit exclude parameter?

RavanJAltaie commented 1 year ago

https://farm.openzim.org/recipes/ndla_nno_all

RavanJAltaie commented 1 year ago

This is the updated recipe and it's failing https://farm.openzim.org/recipes/ndla.no_no_all

RavanJAltaie commented 4 months ago

@benoit74 the recipe is failing with error:

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "crashed".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Node.js v18.17.1
Traceback (most recent call last):
  File "/usr/bin/zimit", line 566, in <module>
    zimit()
  File "/usr/bin/zimit", line 437, in zimit
    raise subprocess.CalledProcessError(crawl.returncode, cmd_args)
subprocess.CalledProcessError: Command '['crawl', '--failOnFailedSeed', '--waitUntil', 'load', '--title', 'Norwegian Digital Learning', '--description', 'NDLA is a Norwegian joint county enterprise.', '--depth', '-1', '--timeout', '90', '--exclude', 'https://ndla.no/subject:20 ', '--lang', 'nno', '--behaviors', 'autoplay,autofetch,siteSpecific', '--behaviorTimeout', '90', '--diskUtilization', '90', '--url', 'https://ndla.no/', '--userAgent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15 +Zimit contact+zimfarm@kiwix.org', '--cwd', '/output/.tmptk15y3ym', '--statsFilename', '/output/crawl.json']' returned non-zero exit status 1.

Any idea if we can fix this?

benoit74 commented 4 months ago

Issue is upstream: https://github.com/openzim/zimit/issues/266

benoit74 commented 1 month ago

Upstream issue is supposed to be solved in the upcoming 2 release, I've moved this recipe to the bunch of test ZIM we are building to observe behavior

benoit74 commented 1 month ago

New task ongoing: https://farm.openzim.org/pipeline/408a9436-bb42-492d-9692-ea43b20f2e10

RavanJAltaie commented 6 days ago

@benoit74 The recipe is taking 11 days so far and 58% is scraped, shall we wait until it's completed?

benoit74 commented 4 days ago

It is now at 62% (43359 / 68848).

It means that the crawler has made so far 43359 pages in 10 days, 10 hours, 10 minutes, meaning 20s per page in average.

Looking at last log lines, it made 509 pages in 3h32m57s or 12777s, meaning 25s per page in average.

These numbers are quite high, but not very surprising given the fact that there is a lot of videos in many pages of this website.

Supposing the 68848 total pages number is roughly correct, it means there is 25482 pages left.

Use a worth case of 30s per pages, it means about 9 days left to finish. Or about 20 days to run the full recipe.

I suggest that we wait for completion, it looks like it is a lot of time, but still acceptable.

I'm a bit curious to get a better understanding of the real total number of pages to fetch. And how big the final ZIM will big.

Maybe we will realize that both ZIM size and task duration are too high and we will need to make a decision, but at least it will be informed based on real figures.