kiwix / overview

https://kiwix.org
88 stars 14 forks source link

ZIM Deletion Policy #103

Open rgaudin opened 4 months ago

rgaudin commented 4 months ago

As agreed during the Hackathon, here's a request to the Content Team for a ZIM Deletion Policy. We want a Wiki entry that lists all the possible reasons for deleting a ZIM. Requests for deletion will then need to provide said reason.

We've discussed that one of the reason could be Metadata not aligned anymore with our Q/A standard. We want to ultimately allow content team to fix that themselves (@benoit74 to open a ticket on zimfarm). In the mean time, individual delete-requests can exceptionally be fixed by developers using zimrecreate.

benoit74 commented 4 months ago

I will try a proposition of deletion policy, please provide feedback quickly, I would like to enforce it by the end of the month of May 2024 at the latest. WDYT?

Deletion policy

On an exceptional basis, it is possible to delete a ZIM which has been published on library.kiwix.org ; we need to ensure this is kept exceptional. As a publisher we somehow promise users to make our best effort so they keep access to the content we've published once. And as a publisher we need to enforce Q/A so that published ZIMs are known to be OK before publication.

Scope details:

The only acceptable reasons to delete a ZIM are:

Except the last one, all these reasons are not expected to happen on a regular basis, or even never happened in the past, so we expect they will continue to lead to a very low level of ZIM deletions.

Following reasons are not acceptable:

benoit74 commented 4 months ago

@RavanJAltaie FYI, suggestions are welcomed

benoit74 commented 4 months ago

Zimfarm issue about metadata update is here: https://github.com/openzim/zimfarm/issues/956

rgaudin commented 4 months ago

LGTM ;

@Popolechien do you remember the wikipedia_en_all_maxi ZIM that had one article defaced with racist content at the time of scrape? 2023-10-07

I think this would match “ZIM content is now known to be wrong” but we'd still have to discuss case-by-case whether it's worth deleting (as we know vandalized articles are most likely included in every ZIM).

Popolechien commented 4 months ago

Yeah I don't think this particular case fit in the reasons listed, but then this seems fairly common sense. Maybe add something along the lines of "Zim content deviates significantly from educational mission". There's another zim that has been flagged recently as moving away from prepper content/thematics to simple product placement: I still need to look into it but to me that would also warrant removal.

Other than that, I would remove this sentence from the intro:

As a publisher we somehow promise users to make our best effort so they keep access to the content we've published once.

Not sure about the somehow (sounds weird to me), but more broadly that makes us an archival project, which I don't really agree with (plus the fact that we don't make any effort to ensure compatibility with older zim files; nor can we afford to, as a matter of fact)

benoit74 commented 4 months ago

Yeah I don't think this particular case fit in the reasons listed, but then this seems fairly common sense

I don't agree, a policy is meant to avoid relying on common sense since it is clear that this is to much a topic of interpretation.

I would add a reason like "ZIM contains vandalized / defaced content on important pages". I'm a bit afraid this is still a bit too subjective, but the past showed us that we made the decision to delete the ZIM for one single vandalized page, so it seems this is the path we want to follow.

Maybe add something along the lines of "Zim content deviates significantly from educational mission".

I would make it even broader with "ZIM content does not match acceptable content policy (educational mission, ...)"

Not sure about the somehow (sounds weird to me), but more broadly that makes us an archival project, which I don't really agree with

I don't mind to remove the "somehow". But still I don't think this phrase makes us an archival project, and I consider it is very important. Most content providers have the same kind of core promise.

For instance, StackExchange gets contributions because they promise users will continue to get access to the published content for "the time being". StackExchange has a strong policy on which questions might get deleted at https://meta.stackexchange.com/help/deleted-questions (and they do delete a lot AFAIK). Without both, I'm quite sure the project would fade out quickly.

If we remove this sentence, then I don't get why we would really need a deletion policy and what could help us decide what is acceptable or not in this policy. I would consider we might delete any ZIM which is not suiting any of us anymore, whatever the reason, since it is clearly the least effort path and our available bandwidth is very limited anyway.

To help me better understand, I would probably benefit from another "core promise" which explains why the deletions I've listed as not acceptable are indeed not acceptable. Otherwise it looks to me this will always be the topic of debates.

That being said, if at least we are all aligned today on the acceptable reasons, I don't mind we remove the phrase if it is not ok for a majority (I don't like consensus ^^)

rgaudin commented 4 months ago

but the past showed us that we made the decision to delete the ZIM for one single vandalized page, so it seems this is the path we want to follow.

Very important clarification: we did not remove that content from the Catalog. We removed one ZIM file because we keep two specifically for such reasons. If the latest one out of the Zimfarm has an issue, we can delete it and continue to serve the content (we only serve one version of a Title at once). Also, that content is being refreshed periodically (but recreating is fragile and takes time).

I think in my mind the policy was for for removing content and not individual ZIMs when there's another one but it's probably the place to clarify both situation

benoit74 commented 4 months ago

I think in my mind the policy was for for removing content and not individual ZIMs when there's another one but it's probably the place to clarify both situation

It is named "ZIM deletion policy", so I thought we wanna deal with individuals ZIMs. This is intentional from my side, and the reason why I clearly mentioned these "two more recent versions". And probably the right granularity for such a policy since anyway deletion requests are usually done at the ZIM level (not content).

Popolechien commented 4 months ago

We will never be able to cover every possible way things can go wrong, unless the policy goes into so much detail that it becomes irrelevant. There will always be some level of arbitrary decision.

For the case referred to of a specific Zim file with problematic content, the informal policy we had with @RavanJAltaie is "Do people complain, which means that people notice?". That allows us to identify high-traffic, high visibility zim files/pages that need immediate action (whereas low-traffic ones can be automatically handled by the next scraper iteration.

A choice has to be made between "Delete old zim files, with exceptions" and "Do not delete zim files, with exceptions". Finding a wording that intersects both would be ideal.