Open tronba opened 1 year ago
@tronba Thanks. There's no off-the-shelf way for us to parse out copyrighted images. Would the lessons still work without these illustrations (much like wikipedia without images is still 99% Wikipedia)?
@Popolechien Good day; the copyrighted images are almost exclusively stock photos at the top /middle of articles. There to make things pretty, but not crucial for the content. Where illustrations are needed for the texts, these are mainly custom-made and CC-lisenced, a lot of the "pretty" images are also CC. The site with no pictures would work (have value), but it would not be the best since it's teaching resources that often refer to things shown in images. The site without the copyrighted pictures removed would work well.
@Popolechien I did a spot-check of 20 random articles from the site. 17 used CC images only. The last 3 had one copyright picture per article (non of them essential for article content). The site has about 35 000 articles (if I'm not mistaken).
@RavanJAltaie can you please make a first pass, and add https://ndla.no/subject:20 in the zimit exclude
parameter?
This is the updated recipe and it's failing https://farm.openzim.org/recipes/ndla.no_no_all
@benoit74 the recipe is failing with error:
node:internal/process/promises:288
triggerUncaughtException(err, true /* fromPromise */);
^
[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "crashed".] {
code: 'ERR_UNHANDLED_REJECTION'
}
Node.js v18.17.1
Traceback (most recent call last):
File "/usr/bin/zimit", line 566, in <module>
zimit()
File "/usr/bin/zimit", line 437, in zimit
raise subprocess.CalledProcessError(crawl.returncode, cmd_args)
subprocess.CalledProcessError: Command '['crawl', '--failOnFailedSeed', '--waitUntil', 'load', '--title', 'Norwegian Digital Learning', '--description', 'NDLA is a Norwegian joint county enterprise.', '--depth', '-1', '--timeout', '90', '--exclude', 'https://ndla.no/subject:20 ', '--lang', 'nno', '--behaviors', 'autoplay,autofetch,siteSpecific', '--behaviorTimeout', '90', '--diskUtilization', '90', '--url', 'https://ndla.no/', '--userAgent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15 +Zimit contact+zimfarm@kiwix.org', '--cwd', '/output/.tmptk15y3ym', '--statsFilename', '/output/crawl.json']' returned non-zero exit status 1.
Any idea if we can fix this?
Issue is upstream: https://github.com/openzim/zimit/issues/266
Upstream issue is supposed to be solved in the upcoming 2 release, I've moved this recipe to the bunch of test ZIM we are building to observe behavior
New task ongoing: https://farm.openzim.org/pipeline/408a9436-bb42-492d-9692-ea43b20f2e10
@benoit74 The recipe is taking 11 days so far and 58% is scraped, shall we wait until it's completed?
It is now at 62% (43359 / 68848)
.
It means that the crawler has made so far 43359 pages in 10 days, 10 hours, 10 minutes, meaning 20s per page in average.
Looking at last log lines, it made 509 pages in 3h32m57s or 12777s, meaning 25s per page in average.
These numbers are quite high, but not very surprising given the fact that there is a lot of videos in many pages of this website.
Supposing the 68848 total pages number is roughly correct, it means there is 25482 pages left.
Use a worth case of 30s per pages, it means about 9 days left to finish. Or about 20 days to run the full recipe.
I suggest that we wait for completion, it looks like it is a lot of time, but still acceptable.
I'm a bit curious to get a better understanding of the real total number of pages to fetch. And how big the final ZIM will big.
Maybe we will realize that both ZIM size and task duration are too high and we will need to make a decision, but at least it will be informed based on real figures.
Hi, thanks for all the work put down so far. I have tested the "ndla.no_no_all_2024-07" version, and it works in significant parts (you can search for terms and find articles and topics, and some navigation works). Colossal file, though (133GB).
I don't know the process from now on with this, but wanted to share what I found while testing.
There is one major fault, and I found a few minor ones.
Broken links from Frontpage and missing navigation layer Normally, the navigation would be Front page – Field of study – Course - Topic – Article. I can find Courses, Topics, and Articles. And I can navigate between them correctly (using windows client). I can't reach the Field of study layer, and the links on the front page (that would go to Field of study layer, point to the internet.
Minor problems Language switch The site is in two languages, Bokmål and Nynorsk (and all the content is in both). I can access content from both languages by searching, but I cannot switch between article languages. Usually, a switch is possible via the “Velg språk” button on top. (this works using kiwix-serve in linux, but not windows client). It might be an idea to make separate versions for each language (since both contain all the content), which would make the size more practical.
H5P content unavailable The site uses some H5P content, but this doesn’t seem to load (at least not in Windows clients).
Video unavailable The site contains many videos, but they won’t start playing. Because of the Kiwix file's huge size, I suspect the video content is included, but the player doesn’t work. Finding a way to keep the videos would be cool, but it might be more practical to drop them.
Articles won't show when using kiwix-serve Front page shows, and I can search, get hits. When I open an article it shows for a moment, and then moves me to the NDLA version of 404. I run kiwix-server under Ubuntu 24.04 (I suspect it uses the 3.5 version).
Hi, I confirm the ZIM suffers from many flaws.
From my PoV, it is not possible currently with zimit to properly create a ZIM of this website.
Marking this as "Scraper Needed" even if I'm not sure how we can scrape "better", to be investigated.
This ZIM is not going to happen in the coming months unless we have a volunteer with dev skills looking at it.
Please use the following format for a ZIM creation request (and delete unnecessary information)