stactools-packages / modis

stactools package for working with MODIS data
Other
3 stars 3 forks source link

Item IDs shouldn't include the production date. #87

Closed TomAugspurger closed 2 years ago

TomAugspurger commented 2 years ago

Currently, the items generated by this package include a production date (https://lpdaac.usgs.gov/data/get-started-data/collection-overview/missions/modis-overview/#modis-naming-conventions). In the item ID MCD15A2H.A2021265.h00v08.061.2021320165929, the 2021320165929 portion is the production date (in Julian form).

Occasionally, the upstream data provider will reprocess assets. AFAICT, the original assets are deleted and replaced with the new assets. The new assets have the same actual datetime / date range and the same tile IDs. But because of how the item IDs are derived, the item created for the new assets will have a different ID than the old ones.

While it isn't 100% clear to me what the right thing to do is, I think that (by default) the item IDs shouldn't include the production date. In this case, the upstream provider is (I think) deleting the old assets and replacing them with the new ones, so presumably they think the new assets should replace the old ones. And so I think the new item should replace the old one too.

Here's a couple examples:

TomAugspurger commented 2 years ago

Here's our tentative plan for the Planetary Computer:

  1. We'll adopt the new Item ID scheme (no processing date) going forward. Any item with a "new" item ID that are reprocessed will be updated in-place. The "old" assets, which are no longer accessible from upstream will either be orphaned (no STAC item pointing to them) or deleted.
  2. We'll delete items with the old Item ID that have been reprocessed. This will break any links pointing directly to these items. We've judged that ensuring uniqueness for any given (acquisition datetime, tile_id) is more valuable.
  3. We'll keep the old Item ID scheme for existing items that are the "newest" for that (acquisition date, tile_id) (i.e. they have never been reprocessed or they are a reprocessed item). That way we don't every workflow linking to a particular item in the Planetary Computer.

So we'll be able to say that

  1. All items after date "X" have the new item ID scheme, while those older than date "X" have the old item ID scheme.
  2. We don't have any duplicates.

It's worth mentioning that we'd ideally use the version extension. When an asset is reprocessed by the upstream provider we would update the STAC item with the new assets and add a version link to the item linking to the old assets. But implementing that is relatively complicated so for now we will silently update the items when the assets are reprocessed.