openzim / overview

:balloon: Start here for current projects, how to get involved, and joining community calls. A resource for new and veteran members of the offline commmunity
2 stars 1 forks source link

Add new standard metadata (optional) to differentiate content harvesting period from ZIM creation date #40

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

Currently, there is only one standard metadata named Date in ZIM metadata. Documentation specifically states this is the ZIM creation date.

There is no standard metadata to store information about when the ZIM content has been captured / fetched / crawled / scraped / ...

Given the fact that we rebuild regularly ZIMs (see ZIM Update v2 at https://github.com/openzim/overview/issues/35 and https://wiki.openzim.org/wiki/ZIM_Updates) and we more and more process content that has been harvested at a time different than the ZIM creation (all stackexchange, some zimit with warcs reprocessed), it is useful to consider adding a new standard metadata to store this information.

Given the fact that content (e.g. with zimit) can be scrapped across multiple days, it seems important that the date is in fact a range from-to.

Just like current Date metadata, I think that we should keep this metadata understandable / easy to grab by keeping it only a day, not a day+time.

Given the fact that some content might come with lower precision than a day (e.g. when a content provider says "this is the content for April 2023, do not mind which day I published it"), I think we need to allow passing only a month or only a year in this metadata.

I hence propose to introduce this new standard ZIM metadata:

WDYT?

mgautierfr commented 1 month ago

Related to https://github.com/openzim/overview/issues/9