Collect and publish data for multiple different time slots

pnuu commented 1 year ago

This PR adds a way to collect metadata for multiple different configurable time slots and publish them in a single message.

Closes #140

codecov[bot] commented 1 year ago

Codecov Report

Merging #142 (ad7cbbe) into main (5f154a1) will decrease coverage by 0.68%. The diff coverage is 97.90%.

@@            Coverage Diff             @@
##             main     #142      +/-   ##
==========================================
- Coverage   91.64%   90.96%   -0.68%     
==========================================
  Files          27       29       +2     
  Lines        4115     4547     +432     
==========================================
+ Hits         3771     4136     +365     
- Misses        344      411      +67

Flag	Coverage Δ
unittests	`90.96% <97.90%> (-0.68%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pytroll_collectors/segments.py	`93.14% <96.66%> (+0.64%)`	:arrow_up:
pytroll_collectors/tests/test_segments.py	`100.00% <100.00%> (ø)`

... and 3 files with indirect coverage changes

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 5087848949

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 0.0%

Totals
Change from base Build 5067098725:	0.0%
Covered Lines:	0
Relevant Lines:	0

💛 - Coveralls

pnuu commented 1 year ago

For reference, here are some message data structures I found from my production logs.

For segment gatherer I only found this structure:

dataset = {
    "start_time": "2023-05-25T10:50:00",
    "platform_name": "Meteosat-11",
    "sensor": ["seviri"],
    "dataset": [
        {
            "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__",
            "uid": "H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__"
        },
        ...
        {
            "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__",
            "uid": "H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__"
        }
    ],
}

Message type is dataset. The same structure is present also when collecting e.g. VIIRS channel segments.

For the simple case of single-segment AVHRR data the geographic gatherer returns collection messages such as

collection = {
    "sensor": "avhrr",
    "platform_name": "Metop-C",
    "start_time": "2023-05-25T06:24:00",
    "end_time": "2023-05-25T06:33:00",
    "collection": [
        {
            "start_time": "2023-05-25T06:24:00",
            "end_time": "2023-05-25T06:25:00",
            "uri": "/data/oper/avhrr/ears/level0/AVHR_HRP_00_M03_20230525062400Z_20230525062500Z_N_O_20230525062820Z",
            "uid": "AVHR_HRP_00_M03_20230525062400Z_20230525062500Z_N_O_20230525062820Z"
        },
        ...
        {
            "start_time": "2023-05-25T06:32:00",
            "end_time": "2023-05-25T06:33:00",
            "uri": "/data/oper/avhrr/ears/level0/AVHR_HRP_00_M03_20230525063200Z_20230525063300Z_N_O_20230525063403Z",
            "uid": "AVHR_HRP_00_M03_20230525063200Z_20230525063300Z_N_O_20230525063403Z"
        }
    ]
}

For compact VIIRS data having two channel segments for a single time the collection consists of datasets.

collection_of_datasets = {
    "start_time": "2023-05-11T01:40:54.200000",
    "end_time": "2023-05-11T01:50:51.500000",
    "platform_name": "NOAA-20",
    "sensor": ["viirs"],
    "collection": [
        {
            "dataset": [
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVDNBC_j01_d20230511_t0140542_e0142187_b28372_c20230511015204000213_eum_ops.h5",
                    "uid": "SVDNBC_j01_d20230511_t0140542_e0142187_b28372_c20230511015204000213_eum_ops.h5"
                },
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVMC_j01_d20230511_t0140542_e0142187_b28372_c20230511015212000170_eum_ops.h5",
                    "uid": "SVMC_j01_d20230511_t0140542_e0142187_b28372_c20230511015212000170_eum_ops.h5"
                }
            ],
            "start_time": "2023-05-11T01:40:54.200000",
            "end_time": "2023-05-11T01:42:18.700000"
        },
        ...
        {
            "dataset": [
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVDNBC_j01_d20230511_t0149270_e0150515_b28372_c20230511015839000126_eum_ops.h5",
                    "uid": "SVDNBC_j01_d20230511_t0149270_e0150515_b28372_c20230511015839000126_eum_ops.h5"
                },
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVMC_j01_d20230511_t0149270_e0150515_b28372_c20230511015848000237_eum_ops.h5",
                    "uid": "SVMC_j01_d20230511_t0149270_e0150515_b28372_c20230511015848000237_eum_ops.h5"
                }
            ],
            "start_time": "2023-05-11T01:49:27",
            "end_time": "2023-05-11T01:50:51.500000"
        }
    ]
}

pnuu commented 1 year ago

The multicollection message type could be something like this:

multicollection = {
    "start_times": ["2023-05-25T10:50:00", ... "2023-05-25T11:50:00"],
    "end_times": [],
    "platform_name": "Meteosat-11",
    "sensor": ["seviri"],
    "multicollection":
    [
        {
            "start_time": "2023-05-25T10:50:00",
            "platform_name": "Meteosat-11",
            "sensor": ["seviri"],
            "dataset": [
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__"
                },
                ...
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__"
                }
            ],
        },
        ...
        {
            "start_time": "2023-05-25T11:50:00",
            "platform_name": "Meteosat-11",
            "sensor": ["seviri"],
            "dataset": [
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251150-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251150-__"
                },
                ...
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251150-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251150-__"
                }
            ],
        }
    ]
}

The lists start_times and end_times on the top level might help later on with sorting, data selection, or something. With my chosen path of using segment gatherer internals in the code, it is not possible to collect data from different streams. If the collection happened in a separate process by listening to multiple segment or geographic gatherers, we could get multicollections like this:

multicollection_2 = {
    "start_times": ["2023-05-25T10:50:00", ..., "2023-05-06T21:52:10.300000"],
    "end_times": [None, ..., "2023-05-06T21:53:34.800000"],
    "platform_names": ["Meteosat-11", ..., "NOAA-20"],
    "sensors": ["seviri", ..., "viirs"],
    "multicollection":
    [
        {
            "start_time": "2023-05-25T10:50:00",
            "platform_name": "Meteosat-11",
            "sensor": ["seviri"],
            "dataset": [
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__"
                },
                ...
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__"
                }
            ],
        },
        ...
        {
            "start_time": "2023-05-06T21:52:10.300000",
            "end_time": "2023-05-06T21:53:34.800000",
            "platform_name": "NOAA-20",
            "sensor": ["viirs"]
            "dataset": [
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVDNBC_j01_d20230506_t2152103_e2153348_b28312_c20230506220612000459_eum_ops.h5",
                    "uid": "SVDNBC_j01_d20230506_t2152103_e2153348_b28312_c20230506220612000459_eum_ops.h5"
                },
                ...
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVMC_j01_d20230506_t2152103_e2153348_b28312_c20230506220623000658_eum_ops.h5",
                    "uid": "SVMC_j01_d20230506_t2152103_e2153348_b28312_c20230506220623000658_eum_ops.h5"
                }
            ],
        }
    ]
}

This structure could be used in collecting geo ring data for example, which could then be processed with Satpy MultiScene in one go. Now that I think of it, this would need a completely different logic compared to the initial purpose of this PR (publish multiple scenes with the same time for example), so I'll go with the former.

pnuu commented 1 year ago

As @mraspaud said in https://github.com/pytroll/pytroll-collectors/issues/140#issuecomment-1560735188 , I'll swap this collection type to temporal_collection and start building the metadata collection.

pytroll / pytroll-collectors