openzim / zimfarm

Farm operated by bots to grow and harvest new zim files
https://farm.openzim.org
GNU General Public License v3.0
82 stars 25 forks source link

Simplify warehouse list #958

Open Popolechien opened 2 months ago

Popolechien commented 2 months ago

The current list of warehouse paths _sort of goes along the lines of the various scrapers we are using, but not really and this gets confusing particularly as it seems to give prominence to content for no obvious reason (e.g. vikidia, psiram). I also understand that besides a modicum of ordering available file this system may be used by various mirrors to pick and chose which content they want to mirror (in practice however only the Wikimedia Foundation restricts its mirroring to Wikimedia-related content). image

I suggest simplifying the list of warehouses to be more congruent with our scrapers, ie:

freecodecamp
gutenberg
ifixit
nautilus
openedx
phet
stackexchange
youtube (incl. ted)
wikihow
wikimedia
other wikis

(not discussing the /.hidden folders that have their own, clearly-defined purpose)

The naming is not 100% ideal as we need to force a distinction between WMF and non-WMF wikis but other than that it seems a move in the right direction.

rgaudin commented 2 months ago

OK, just so we're clear it's going to be a difficult task because mirrors uses rsync and there is no such thing as renaming there. So if it's not properly coordinated (and we're talking about 12 different people) it could result in incredible transfers: deleting everything and re-downloading everything for instance.

benoit74 commented 2 months ago

Could someone push a documentation or an explanation on what is the intent of these warehouse paths, so that we are all on the same page on this question before making any decision?

rgaudin commented 2 months ago

A number of users including ourselves have always been using it to find and download ZIM files.

It used to be this or the wiki. Now all readers (but kiwix-serve) have an included downloader and we have library.kiwix.org that offers download as well.

I personally use it exclusively but have never been attached to the folders.

kelson42 commented 2 months ago

What are the warehouse folders is arbitrary and to a large extend should not be that important (for end users). The problem here is that it is "confusing" for Zimfarm editors, and this is IMHO primarely a UI problem.

We could choose almost automaticaly where to store the ZIM files based on the scraper and by choosing the "collection".

The "collection" means basically: in which library the produced ZIM should appear. For now we have formally only one collection. But once this will properly handled in CMS we will have many of them.

Still a bit unsure about how the separation of duties should exactly look like between the Zimfarm and the CMS.