Closed kelson42 closed 9 months ago
First one is : https://www.bbc.com/persian
Contrary to what I said yesterday, it looks like https://www.bbc.com/persian is not immediately failing.
I crawled only 20 pages for now, but the ZIM is quite ok even with Zimit 1 (except all known limitations on readers, ...) and it looks like we are not immediately blocked. Let's wait for Zimit 2 and create the recipe on the Zimfarm.
Since it's the only reference website we have, one could crawl it and share the WARC files. This could be helpful for warc2zim QA
I will start a crawl on my server and share the resulting WARC files. No guarantee, but at least it is a good idea to try a full crawl on a reference website and avoid loosing time on crawling the same site over and over again.
Not sure if you're wanting suggestions for "challenging" sites (yet), but the site codingfortomorrow
, for which we have a successful ZIM codingfortomorrow_de_all_2023-11.zim
, has notable cookies on the front page, which are set once on first access (and if that is working, the user should not be bothered again by them on subsequent visits), and has sliding page elements, a lot of dynamic JS, lazy-loaded images, and CSS all over the place. It would make a good test of advanced features IMHO.
A slightly less challenging site, but with some useful features to test, is Mozilla Development Network (developer.mozilla.org
), though the problem here is that we haven't had a successful scrape since February. But if we could get a reference scrape, then the dynamic code boxes (for trying out features being discussed) and the browser compatibility tables also make for a good test of functionality.
We are still waiting for an additional 5 names from the Open Net folks, but overall yeah it might no be a bad idea to test for websites that are in scope and present challenges we're likely to see across other websites.
What about https://www.harmony-project.org/music-library? As per this issue, we decided to wait for Zimit 2.0 to try on.
As far as I've understood, the decision has been made to focus first on the ZIMs we've been requested for Zimit 2.0 (and paid for). And the goal of this issue is to track the progress we make in retrieving this list.
Of course harmony and other pending issues will be done as well at some point.
I've opened a dedicated ZIM request for BBC persian, and I suggest we do the same for all other Zimit 2.0 web sites.
I guess there won’t be more zim files, so I propose to close and start to migrate pre-existing ZIM files.
I would have thought we need a testing period, right? And I assumed these 10 would be part of the test.
Based the 9 other ZIM files necessary to assess Zimit 2.0 should be:
Please report regressions to new ticket (and put in project)
@RavanJAltaie @benoit74 Please do the necessary as soon as Zimit 2.0 version is available in the Zimfarm
@kelson42 should we duplicate the recipes to push the new ZIM to dev.library.kiwix.org first? Or maybe just update the prod recipe to push to dev for a short while is sufficient?
@kelson42 should we duplicate the recipes to push the new ZIM to dev.library.kiwix.org first? Or maybe just update the prod recipe to push to dev for a short while is sufficient?
Second approach: same approach, back to dev
I updated all recipes to use zimit2, push to dev, and requested all of them (NOTA: some recipes were disabled, most were already pushing to dev, the 100r.co recipe is at https://farm.openzim.org/recipes/100r.co for now)
@benoit74 THX, so closing the issue there. We should now assess the newly created ZIM files.
/ We need first the list of Web sites to scrape /