openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
40 stars 2 forks source link

Delete incorrect flavour files #843

Closed RavanJAltaie closed 3 months ago

RavanJAltaie commented 7 months ago

Please delete these files as a part of cleaning the library. Their corresponding recipes have been corrected and disabled for continuous failing.

  1. https://download.kiwix.org/zim/other/appropedia_en_all_maxi_2021-03.zim
  2. https://download.kiwix.org/zim/other/appropedia_en_all_nopic_2021-03.zim
  3. https://download.kiwix.org/zim/other/allthetropes_en_all_maxi_2020-10.zim
  4. https://download.kiwix.org/zim/other/artofproblemsolving_en_all_maxi_2021-03.zim
  5. https://download.kiwix.org/zim/other/artofproblemsolving_en_all_nopic_2021-03.zim
  6. https://download.kiwix.org/zim/other/dandwiki_en_all_maxi_2021-03.zim
  7. https://download.kiwix.org/zim/other/dandwiki_en_all_nopic_2021-03.zim
  8. https://download.kiwix.org/zim/other/edutechwiki_en_all_maxi_2021-03.zim
  9. https://download.kiwix.org/zim/other/edutechwiki_fr_all_maxi_2021-03.zim
  10. https://download.kiwix.org/zim/other/ekopedia_fr_all_maxi_2021-03.zim
  11. https://download.kiwix.org/zim/other/ekopedia_fr_all_nopic_2021-03.zim
  12. https://download.kiwix.org/zim/other/explainxkcd_en_all_maxi_2021-03.zim
  13. https://download.kiwix.org/zim/other/gentoo_en_all_maxi_2021-03.zim
  14. https://download.kiwix.org/zim/other/gentoo_en_all_nopic_2021-03.zim
  15. https://download.kiwix.org/zim/other/granbluefantasy_en_all_maxi_2021-03.zim
  16. https://download.kiwix.org/zim/other/granbluefantasy_en_all_nopic_2021-03.zim
  17. https://download.kiwix.org/zim/other/hamichlol_he_all_maxi_2021-02.zim
  18. https://download.kiwix.org/zim/other/hamichlol_he_all_nopic_2021-02.zim
  19. https://download.kiwix.org/zim/other/installgentoo_en_all_nopic_2019-09.zim
  20. https://download.kiwix.org/zim/other/installgentoo_en_all_nopic_2019-09.zim
  21. https://download.kiwix.org/zim/other/pokepedia_fr_all_maxi_2021-01.zim
  22. https://download.kiwix.org/zim/other/pokewiki_de_all_maxi_2021-03.zim
  23. https://download.kiwix.org/zim/other/rationalwiki_en_all_maxi_2021-03.zim
  24. https://download.kiwix.org/zim/other/rationalwiki_en_all_nopic_2021-03.zim
  25. https://download.kiwix.org/zim/other/scratch-wiki_fr_all_maxi_2021-02.zim
  26. https://download.kiwix.org/zim/other/scratch-wiki_fr_all_nopic_2021-02.zim
  27. https://download.kiwix.org/zim/other/stupidedia_de_all_maxi_2021-02.zim
  28. https://download.kiwix.org/zim/other/wikem_en_all_maxi_2021-02.zim
  29. https://download.kiwix.org/zim/other/wikem_en_all_nopic_2021-02.zim
  30. https://download.kiwix.org/zim/other/wikishia_fa_all_maxi_2021-03.zim
  31. https://download.kiwix.org/zim/other/wikishia_fa_all_nopic_2021-03.zim
  32. https://download.kiwix.org/zim/other/yeshiva_he_all_maxi_2021-02.zim
  33. https://download.kiwix.org/zim/other/yeshiva_he_all_nopic_2021-03.zim
  34. https://download.kiwix.org/zim/vikidia/vikidia_ru_all_maxi_2021-03.zim
  35. https://download.kiwix.org/zim/vikidia/vikidia_scn_all_maxi_2021-03.zim
  36. https://download.kiwix.org/zim/vikidia/vikidia_scn_all_nopic_2021-03.zim
  37. https://download.kiwix.org/zim/wikiversity/wikiversity_fr_all_maxi_2021-02.zim
  38. https://download.kiwix.org/zim/wikiversity/wikiversity_fr_all_nopic_2021-02.zim
  39. https://download.kiwix.org/zim/wikiversity/wikiversity_en_all_maxi_2021-03.zim
  40. https://download.kiwix.org/zim/wikiversity/wikiversity_en_all_nopic_2021-03.zim
  41. https://download.kiwix.org/zim/wikiversity/wikiversity_de_all_maxi_2021-03.zim
  42. https://download.kiwix.org/zim/wikiversity/wikiversity_de_all_nopic_2021-03.zim
  43. https://download.kiwix.org/zim/wikisource/wikisource_zh-min-nan_all_maxi_2020-07.zim
  44. https://download.kiwix.org/zim/wikisource/wikisource_zh-min-nan_all_nopic_2020-07.zim
  45. https://download.kiwix.org/zim/wikisource/wikisource_sa_all_maxi_2021-02.zim
  46. https://download.kiwix.org/zim/wikisource/wikisource_sa_all_nopic_2021-02.zim
  47. https://download.kiwix.org/zim/wikipedia/wikipedia_sk_all_maxi_2021-03.zim
  48. https://download.kiwix.org/zim/wikipedia/wikipedia_sk_all_mini_2021-03.zim
  49. https://download.kiwix.org/zim/wikibooks/wikibooks_en_all_maxi_2021-03.zim
  50. https://download.kiwix.org/zim/wikibooks/wikibooks_en_all_nopic_2021-03.zim
  51. https://download.kiwix.org/zim/wikibooks/wikibooks_de_all_maxi_2021-03.zim
  52. https://download.kiwix.org/zim/wikibooks/wikibooks_de_all_nopic_2021-03.zim
benoit74 commented 7 months ago

I don't get why we are deleting these files if the recipe is continuously failing and the only problem in the ZIM is an incorrect flavor.

If the recipe is continuously failing, I would rather find a way to fix the flavor in the ZIM (only devs can do that, but it is probably feasible) so that we do not loose all these contents.

@Popolechien @kelson42 WDYT?

Popolechien commented 7 months ago

Well there's the problem of content becoming stale (I remember a recent deletion request for a game-related zim file that I posted not too long ago), but more broadly this raises the question of whether we're an archiving service or an offline internet one (I think we're the latter).

There's an ancillary impact to deleting some of the files (MDwiki is used in one of the pre-made configs we sell from the imager, for instance), so there better be a discussion of what the policy should be, e.g. if

  1. The content is evergreen but the scraper is not working
  2. The content is evergreen but the source website is dead
  3. The content is bound to evolve and the scraper is not working
  4. The content is bound to evolve and the website is dead

To be clear, by "evergreen" I mean content that was relevant then, and still is now. Medical content or encyclopedia-type are probably there; at the other end of the spectrum I see content that is related to an evolving topic (e.g. games, entertainement, user manuals for tech à la scratchwiki or gentoo).

I reckon that we should keep files meeting criterion 1 (then the hard question is when content is evergreen -ish and how much grey we are willing to handle, or how much time we're willing to spend trying to figure out if Pokemon is still evolving as a game).

benoit74 commented 7 months ago

You are right, this is probably a very good example of a missing Policy around criteria for deletion of ZIMs.

My personal taste is that we should keep as much content as possible until it is either significantly outdated or represents an issue in term of copyright or acceptable content. As a user, I would be embarrassed / sad to realize that offline contents are disappearing at the same pace than the online websites. As a contributor, I would be embarrassed to realize we are throwing content we've spend time / resources to create. But clearly, this is a personal taste and I might not be aligned with Kiwix / openZIM goals.

Somehow, I don't think the archiving / offline internet is the proper question, I agree that we should not keep all versions of all content we've ever created for archiving purposes. But it does not mean we shouldn't keep as much content as possible since we've invested in them.

benoit74 commented 7 months ago

Let's discuss about it in our Friday meeting maybe?

RavanJAltaie commented 7 months ago

@benoit74 We've discussed this with @Popolechien. It might help to know that all these files have already another copy in the library. So deleting them will not affect on archiving content. Please proceed in deleting them.

benoit74 commented 7 months ago

It might help to know that all these files have already another copy in the library.

This statement is false, at least allthetropes_en_all_maxi_2020-10.zim has no other copy in the library (maybe it is the single exception).

And I don't get what makes you believe that another file has a better Flavor metadata?

I had a look and it seems that you are requesting to delete the most recent file for every recipe.

I doubt that an older file will have a better Flavor metadata.

I checked appropedia_en_all_maxi_2021-01.zim (the one will be stay in the library after your cleanup request) and can confirm this older copy has the same flaw in Flavor value:

>>> zim = Archive("appropedia_en_all_maxi_2021-01.zim")
>>> zim.metadata
{
    "Counter": "text/plain=10;text/css=40;application/javascript=38;image/png=1;text/html=10150;image/webp=22548;application/pdf=74;image/gif=609;text/html; charset=UTF-8=37;undefined=3;text/html; charset=ISO-8859-1=1;text/html; charset=utf-8=3;text/html;charset=UTF-8=1",
    "Creator": "Appropedia",
    "Date": "2021-01-28",
    "Description": "Sharing knowledge to build rich & sustainable lives",
    "Flavour": "_maxi",
    "Language": "eng",
    "Name": "appropedia_en_all",
    "Publisher": "Kiwix",
    "Scraper": "mwoffliner 1.11.3",
    "Tags": "appropedia;_pictures:yes;_videos:no;_details:yes;_ftindex:yes",
    "Title": "Appropedia"
}
benoit74 commented 6 months ago

Should we discuss this live?

RavanJAltaie commented 5 months ago

@benoit74 as per our last discussion, I've reviewed the links of proposed files for deletion, all files has another copy except these files:

https://download.kiwix.org/zim/other/allthetropes_en_all_maxi_2020-10.zim https://download.kiwix.org/zim/other/installgentoo_en_all_nopic_2019-09.zim https://download.kiwix.org/zim/other/installgentoo_en_all_nopic_2019-09.zim

You can keep them for archiving purpose and delete the rest of list please.

benoit74 commented 5 months ago

I'm glad you confirm that we have a previous version for most ZIMs, but I still have the same question: are you sure that previous version of these ZIMs have correct flavor? I suspect they don't have a more appropriate flavor. Did you checked that as well?

Otherwise I think there is only two solutions forward:

RavanJAltaie commented 3 months ago

@benoit74 Just to be in the same page, at earlier stage of this fix project, I already manually fixed all the wrong flavours in the recipes in zimfarm. The second stage is to remove the affected files, the files mentioned in this issue are the affected ones and need to be deleted. The older versions of the files are ok, I checked, as the error causing this problem was temporary and fixed.

benoit74 commented 3 months ago

@RavanJAltaie I've deleted the first two files (appropedia nopic and maxi) to check if everything is fine. Version 2021-03 is now gone, and replaced by version 2021-01. Library is now updated to use this 2021-01 version.

Please have a look at https://library.kiwix.org/viewer#appropedia_en_all_maxi_2021-01 and https://library.kiwix.org/viewer#appropedia_en_all_nopic_2021-01

As far as I can tell these older versions are no better than the 2021-03 version I've just deleted. Flavour is still wrong on both ZIMs: https://library.kiwix.org/raw/appropedia_en_all_maxi_2021-01/meta/Flavour and https://library.kiwix.org/raw/appropedia_en_all_nopic_2021-01/meta/Flavour

I'm sorry but there is something I really don't get on this, why are you saying "The older versions of the files are ok, I checked, as the error causing this problem was temporary and fixed.", could you explain me what is better in the 2021-01 version of appropedia ZIMs compared to the 2021-03 version?

RavanJAltaie commented 3 months ago

@benoit74 I've double checked the related recipes, all of them are already disabled except this one: wikisource_zh-min-nan is active and succeeding. So I believe you can start the process of creating the files correctly (manually) as discussed, then we can delete the wrong files from the library.

benoit74 commented 3 months ago

OK thank you, as discussed we do not need to delete wrong files, they will be replaced. Closing this issue, nothing to do in this issue in fact, job's left to do is tracked in https://github.com/openzim/zim-requests/issues/1089