openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
42 stars 3 forks source link

New request: NDLA - Norwegian Digital Learning Arena #626

Open tronba opened 1 year ago

tronba commented 1 year ago

Please use the following format for a ZIM creation request (and delete unnecessary information)

Popolechien commented 1 year ago

@tronba Thanks. There's no off-the-shelf way for us to parse out copyrighted images. Would the lessons still work without these illustrations (much like wikipedia without images is still 99% Wikipedia)?

tronba commented 1 year ago

@Popolechien Good day; the copyrighted images are almost exclusively stock photos at the top /middle of articles. There to make things pretty, but not crucial for the content. Where illustrations are needed for the texts, these are mainly custom-made and CC-lisenced, a lot of the "pretty" images are also CC. The site with no pictures would work (have value), but it would not be the best since it's teaching resources that often refer to things shown in images. The site without the copyrighted pictures removed would work well.

tronba commented 1 year ago

@Popolechien I did a spot-check of 20 random articles from the site. 17 used CC images only. The last 3 had one copyright picture per article (non of them essential for article content). The site has about 35 000 articles (if I'm not mistaken).

Popolechien commented 1 year ago

@RavanJAltaie can you please make a first pass, and add https://ndla.no/subject:20 in the zimit exclude parameter?

RavanJAltaie commented 1 year ago

https://farm.openzim.org/recipes/ndla_nno_all

RavanJAltaie commented 1 year ago

This is the updated recipe and it's failing https://farm.openzim.org/recipes/ndla.no_no_all

RavanJAltaie commented 9 months ago

@benoit74 the recipe is failing with error:

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "crashed".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Node.js v18.17.1
Traceback (most recent call last):
  File "/usr/bin/zimit", line 566, in <module>
    zimit()
  File "/usr/bin/zimit", line 437, in zimit
    raise subprocess.CalledProcessError(crawl.returncode, cmd_args)
subprocess.CalledProcessError: Command '['crawl', '--failOnFailedSeed', '--waitUntil', 'load', '--title', 'Norwegian Digital Learning', '--description', 'NDLA is a Norwegian joint county enterprise.', '--depth', '-1', '--timeout', '90', '--exclude', 'https://ndla.no/subject:20 ', '--lang', 'nno', '--behaviors', 'autoplay,autofetch,siteSpecific', '--behaviorTimeout', '90', '--diskUtilization', '90', '--url', 'https://ndla.no/', '--userAgent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15 +Zimit contact+zimfarm@kiwix.org', '--cwd', '/output/.tmptk15y3ym', '--statsFilename', '/output/crawl.json']' returned non-zero exit status 1.

Any idea if we can fix this?

benoit74 commented 9 months ago

Issue is upstream: https://github.com/openzim/zimit/issues/266

benoit74 commented 6 months ago

Upstream issue is supposed to be solved in the upcoming 2 release, I've moved this recipe to the bunch of test ZIM we are building to observe behavior

benoit74 commented 6 months ago

New task ongoing: https://farm.openzim.org/pipeline/408a9436-bb42-492d-9692-ea43b20f2e10

RavanJAltaie commented 4 months ago

@benoit74 The recipe is taking 11 days so far and 58% is scraped, shall we wait until it's completed?

benoit74 commented 4 months ago

It is now at 62% (43359 / 68848).

It means that the crawler has made so far 43359 pages in 10 days, 10 hours, 10 minutes, meaning 20s per page in average.

Looking at last log lines, it made 509 pages in 3h32m57s or 12777s, meaning 25s per page in average.

These numbers are quite high, but not very surprising given the fact that there is a lot of videos in many pages of this website.

Supposing the 68848 total pages number is roughly correct, it means there is 25482 pages left.

Use a worth case of 30s per pages, it means about 9 days left to finish. Or about 20 days to run the full recipe.

I suggest that we wait for completion, it looks like it is a lot of time, but still acceptable.

I'm a bit curious to get a better understanding of the real total number of pages to fetch. And how big the final ZIM will big.

Maybe we will realize that both ZIM size and task duration are too high and we will need to make a decision, but at least it will be informed based on real figures.

tronba commented 3 months ago

Hi, thanks for all the work put down so far. I have tested the "ndla.no_no_all_2024-07" version, and it works in significant parts (you can search for terms and find articles and topics, and some navigation works). Colossal file, though (133GB).

I don't know the process from now on with this, but wanted to share what I found while testing.

There is one major fault, and I found a few minor ones.

Broken links from Frontpage and missing navigation layer Normally, the navigation would be Front page – Field of study – Course - Topic – Article. I can find Courses, Topics, and Articles. And I can navigate between them correctly (using windows client). I can't reach the Field of study layer, and the links on the front page (that would go to Field of study layer, point to the internet.

Minor problems Language switch The site is in two languages, Bokmål and Nynorsk (and all the content is in both). I can access content from both languages by searching, but I cannot switch between article languages. Usually, a switch is possible via the “Velg språk” button on top. (this works using kiwix-serve in linux, but not windows client). It might be an idea to make separate versions for each language (since both contain all the content), which would make the size more practical.

H5P content unavailable The site uses some H5P content, but this doesn’t seem to load (at least not in Windows clients).

Video unavailable The site contains many videos, but they won’t start playing. Because of the Kiwix file's huge size, I suspect the video content is included, but the player doesn’t work. Finding a way to keep the videos would be cool, but it might be more practical to drop them.

Articles won't show when using kiwix-serve Front page shows, and I can search, get hits. When I open an article it shows for a moment, and then moves me to the NDLA version of 404. I run kiwix-server under Ubuntu 24.04 (I suspect it uses the 3.5 version).

benoit74 commented 1 month ago

Hi, I confirm the ZIM suffers from many flaws.

From my PoV, it is not possible currently with zimit to properly create a ZIM of this website.

Marking this as "Scraper Needed" even if I'm not sure how we can scrape "better", to be investigated.

This ZIM is not going to happen in the coming months unless we have a volunteer with dev skills looking at it.