Get 10 ZIM files scraped with Zimit 2.0

openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!

https://farm.openzim.org

38 stars 2 forks source link

Get 10 ZIM files scraped with Zimit 2.0 #766

Closed kelson42 closed 9 months ago

kelson42 commented 10 months ago

/ We need first the list of Web sites to scrape /

Popolechien commented 9 months ago

First one is : https://www.bbc.com/persian

benoit74 commented 9 months ago

Contrary to what I said yesterday, it looks like https://www.bbc.com/persian is not immediately failing.

I crawled only 20 pages for now, but the ZIM is quite ok even with Zimit 1 (except all known limitations on readers, ...) and it looks like we are not immediately blocked. Let's wait for Zimit 2 and create the recipe on the Zimfarm.

rgaudin commented 9 months ago

Since it's the only reference website we have, one could crawl it and share the WARC files. This could be helpful for warc2zim QA

benoit74 commented 9 months ago

I will start a crawl on my server and share the resulting WARC files. No guarantee, but at least it is a good idea to try a full crawl on a reference website and avoid loosing time on crawling the same site over and over again.

Jaifroid commented 9 months ago

Not sure if you're wanting suggestions for "challenging" sites (yet), but the site codingfortomorrow, for which we have a successful ZIM codingfortomorrow_de_all_2023-11.zim, has notable cookies on the front page, which are set once on first access (and if that is working, the user should not be bothered again by them on subsequent visits), and has sliding page elements, a lot of dynamic JS, lazy-loaded images, and CSS all over the place. It would make a good test of advanced features IMHO.

A slightly less challenging site, but with some useful features to test, is Mozilla Development Network (developer.mozilla.org), though the problem here is that we haven't had a successful scrape since February. But if we could get a reference scrape, then the dynamic code boxes (for trying out features being discussed) and the browser compatibility tables also make for a good test of functionality.

Popolechien commented 9 months ago

We are still waiting for an additional 5 names from the Open Net folks, but overall yeah it might no be a bad idea to test for websites that are in scope and present challenges we're likely to see across other websites.

RavanJAltaie commented 9 months ago

What about https://www.harmony-project.org/music-library? As per this issue, we decided to wait for Zimit 2.0 to try on.

benoit74 commented 9 months ago

As far as I've understood, the decision has been made to focus first on the ZIMs we've been requested for Zimit 2.0 (and paid for). And the goal of this issue is to track the progress we make in retrieving this list.

Of course harmony and other pending issues will be done as well at some point.

benoit74 commented 9 months ago

I've opened a dedicated ZIM request for BBC persian, and I suggest we do the same for all other Zimit 2.0 web sites.

kelson42 commented 9 months ago

I guess there won’t be more zim files, so I propose to close and start to migrate pre-existing ZIM files.

Jaifroid commented 9 months ago

I would have thought we need a testing period, right? And I assumed these 10 would be part of the test.

kelson42 commented 9 months ago

Based the 9 other ZIM files necessary to assess Zimit 2.0 should be:

Please report regressions to new ticket (and put in project)

kelson42 commented 9 months ago

@RavanJAltaie @benoit74 Please do the necessary as soon as Zimit 2.0 version is available in the Zimfarm

benoit74 commented 9 months ago

@kelson42 should we duplicate the recipes to push the new ZIM to dev.library.kiwix.org first? Or maybe just update the prod recipe to push to dev for a short while is sufficient?

kelson42 commented 9 months ago

@kelson42 should we duplicate the recipes to push the new ZIM to dev.library.kiwix.org first? Or maybe just update the prod recipe to push to dev for a short while is sufficient?

Second approach: same approach, back to dev

benoit74 commented 9 months ago

I updated all recipes to use zimit2, push to dev, and requested all of them (NOTA: some recipes were disabled, most were already pushing to dev, the 100r.co recipe is at https://farm.openzim.org/recipes/100r.co for now)

kelson42 commented 9 months ago

@benoit74 THX, so closing the issue there. We should now assess the newly created ZIM files.