Add "multi-scene" collecting and publishing

pnuu commented 1 year ago

For creation of multi-temporal datasets data need to be collected and published for multiple time slots.

As an example, https://github.com/pytroll/satpy/pull/2488 needs three distinct datasets:

files for a dataset at T-2
files for a dataset at T-1
files for the latest available dataset

The time-shift between the datasets can be anything, for example 15/30/60 minutes. It can even be irregular if used for polar satellite data or emphasis is needed on one direction or the other.

There are other envisioned needs for this kind of collection/publishing, so the feature needs to be kept as flexible as possible.

Messages

Currently we have the following message types for publishing data:

file: plain json without nested lists nor dictionaries, everything at the "top level" of the message
- used for individual files
dataset: combined metadata (start/end times, platform, and such) at the top level, and a list named dataset of dictionaries having URI and UID of individual files
- used for geostationary segments
collection: same as above, but there is a list named collection with dictionaries of individual start/end times and datasets
- used for multi-segment multi-time data, such as granulated VIIRS SDR swaths

The collection message type could be used for the collection of multi-temporal data that described here, but how to distinguish from the existing usage? Should there be new message type like library (file -> dataset -> collection -> library :stuck_out_tongue_winking_eye:) or something that has a list named library with collections with datasets inside?

Configuration

This is the first crude idea of how to configure which data are published together. The publishing would be triggered after each data collection has terminated.

published_slots:
  - {min_age: 0, max_age: 0}
  - {min_age: 60, max_age: 65}
  - {min_age: 120, max_age: 125}

The min/max ages are relative to the start time of the currently completed collection. Just having the 0/0 combination would equal the current behaviour of publishing the latest completed set. If all the criteria are not met (just after restart, for example, we might not have the earlier slots collected).

Internals

Currently the completed Slots are deleted. We need to add a new check that looks at the published_slots config (and timeliness?) to determine which slots are not needed anymore. As the keys in the self.slots dictionary are the nominal or start time (possibly rounded, depending on config) of the slot as a string, comparison is quite easy.

gerritholl commented 1 year ago

I didn't know there existed standardised message types with defined data structures. Is this defined/documented and/or enforced/tested anywhere?

Should there be new message type like library (file -> dataset -> collection -> library 😜) or something that has a list named library with collections with datasets inside?

For what it's worth, in one software package I know the seven dimensions are called Library, Vitrine, Shelf, Book, Page, Row, Column :-)

On a more serious note, if we do use standardised names and a collection collects all granules or segments belonging to a single scene, then "multicollection" would be I think quite clear in its purpose.

pnuu commented 1 year ago

I didn't know there existed standardised message types with defined data structures. Is this defined/documented and/or enforced/tested anywhere?

I doubt it's documented anywhere. I was thinking the same earlier today. But the above is most of what we have in use in posttroll-based packages. The file message type is the most common. Segment gatherers uses dataset if it receives files, collection if it receives datasets. Geographic collector always publishes collection messages. There are some other types at least in Trollmoves (ack, push, error, pong, err, unknown show up in a quick grep) for internal communications.

On a more serious note, if we do use standardised names and a collection collects all granules or segments belonging to a single scene, then "multicollection" would be I think quite clear in its purpose.

I like that, the data are most likely passed to MultiScene in Satpy, so that'd match.

mraspaud commented 1 year ago

I'm thinking that the difference between collection and mulitcollection is not really obvious, while temporal_collection is more explicit...

pnuu commented 1 year ago

Thanks, I'll think about the naming. I've started with multicollection also for the internals, but changing that isn't too complicated.

pytroll / pytroll-collectors