nsidc / granule-metgen

Metadata generator for direct-to-Cumulus era
Other
0 stars 0 forks source link

Zero-padding rules for collection version #28

Open afitzgerrell opened 1 month ago

afitzgerrell commented 1 month ago

DStew has been asked to come up with a version numbering policy.

Current rules from Daniel:

  1. The shortname and version must match exactly in all five places it is present: The CNM ingest message, The granule-level UMM-G file associated with an input granule, the collection-level ISO record exported from the CMR-Mediator from the EDB, the collection configuration in Cumulus, and the Rule configuration in Cumulus. Since both the shortname and version are strings (not numbers), they must match exactly - there is no concept truncating zero padding - they're all just characters in the version.
  2. By default, the CMR mediator exports collection-level metadata versions without zero padding. This is a manual "extra" configuration to add zero padding for SMAP and ICESat-2 collections that was done to meet this requirement in ECS because the data producers for these collections added zero padding to their version numbers (a significant amount of development time was put into handing this case during SMAP development).
  3. As far as I'm aware, the CMR mediator currently cannot selectively use zero padding for the same collection across different providers (i.e. you can't have NSIDC-0630 version 001 in Cumulus and NSIDC-0630 version 1 in ECS).
  4. Currently, Cumulus uses the exact same zero padding as ECS.
  5. To change the zero padding of a granule would currently require reingesting that granule from scratch into a new collection in Cumulus (to Cumulus, collections with versions with different zero padding are effectively different collections, and there's no way to move granules from one collection to another).
  6. The native id (eg, MODIS/Aqua Snow Cover Daily L3 Global 0.05Deg CMG V061) for the collection in CMR has no impact on this choice, but the native id used by the CMR Mediator to export the collection to CMR must match what is used in the Cumulus collection configuration exactly - I believe by default, this does have zero padding on the version.
  7. The dataset title has no impact on this choice and isn't used anywhere. However, the native id used by the CMR Mediator is the same as the title.

Additionally, determine what will be the authoritative source for determining correct representation of the version. Note, Daniel advised: "Better is to use CMR" curl 'https://cmr.earthdata.nasa.gov/search/collections.json?provider=NSIDC_ECS&short_name=SPL3SMP&version=009' | python -m json.tool

afitzgerrell commented 1 month ago

DStew met yesterday (31 July 2024) and decided that there's no compelling reason* to change our versioning habits for data sets (that aren't ICESat-2 or SMAP), and we'll stay the course of single-character version representation on our end (i.e., in the EDB).

*Reasons to stick with status quo: -Machine readable is n/a since version is contained within a string element in cnm/ummg -Everything ingested to ECS is non padded (except ICESat-2, and SMAP...opted to let these two be the anomalies and not change every other data set we've archived) -Standards evolution nebulous and not critical / not worth affecting ECS decommissioning by accommodating making a change to the way we do versioning presently. -Future versions could adopt new standards. EDSC etc. would have to adapt per ESDIS decision -DPT can accommodate this request easily (they just need to know by end of Sept) -This will not affect file names and will not affect title

Kara is going to make one last verification with Daniel that this won't end up negatively affecting bent-pipe ingest at some point down the road. Once I hear back from her, I'll mark this issue as complete.

afitzgerrell commented 1 month ago

DStew heard back from Kara and finalized sticking with non-zero padded version identifiers (e.g., "6") for all collections except IS-2 and SMAP which will continue being append by the data catalog services (DCS) to pad the version (e.g., "006") for data sets within those two missions.