pytroll / pytroll-collectors

Collector modules for Pytroll
GNU General Public License v3.0
3 stars 18 forks source link

Collect and publish data for multiple different time slots #142

Open pnuu opened 1 year ago

pnuu commented 1 year ago

This PR adds a way to collect metadata for multiple different configurable time slots and publish them in a single message.

Closes #140

codecov[bot] commented 1 year ago

Codecov Report

Merging #142 (ad7cbbe) into main (5f154a1) will decrease coverage by 0.68%. The diff coverage is 97.90%.

@@            Coverage Diff             @@
##             main     #142      +/-   ##
==========================================
- Coverage   91.64%   90.96%   -0.68%     
==========================================
  Files          27       29       +2     
  Lines        4115     4547     +432     
==========================================
+ Hits         3771     4136     +365     
- Misses        344      411      +67     
Flag Coverage Δ
unittests 90.96% <97.90%> (-0.68%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pytroll_collectors/segments.py 93.14% <96.66%> (+0.64%) :arrow_up:
pytroll_collectors/tests/test_segments.py 100.00% <100.00%> (ø)

... and 3 files with indirect coverage changes

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 5087848949

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details


Totals Coverage Status
Change from base Build 5067098725: 0.0%
Covered Lines: 0
Relevant Lines: 0

💛 - Coveralls
pnuu commented 1 year ago

For reference, here are some message data structures I found from my production logs.

For segment gatherer I only found this structure:

dataset = {
    "start_time": "2023-05-25T10:50:00",
    "platform_name": "Meteosat-11",
    "sensor": ["seviri"],
    "dataset": [
        {
            "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__",
            "uid": "H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__"
        },
        ...
        {
            "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__",
            "uid": "H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__"
        }
    ],
}

Message type is dataset. The same structure is present also when collecting e.g. VIIRS channel segments.

For the simple case of single-segment AVHRR data the geographic gatherer returns collection messages such as

collection = {
    "sensor": "avhrr",
    "platform_name": "Metop-C",
    "start_time": "2023-05-25T06:24:00",
    "end_time": "2023-05-25T06:33:00",
    "collection": [
        {
            "start_time": "2023-05-25T06:24:00",
            "end_time": "2023-05-25T06:25:00",
            "uri": "/data/oper/avhrr/ears/level0/AVHR_HRP_00_M03_20230525062400Z_20230525062500Z_N_O_20230525062820Z",
            "uid": "AVHR_HRP_00_M03_20230525062400Z_20230525062500Z_N_O_20230525062820Z"
        },
        ...
        {
            "start_time": "2023-05-25T06:32:00",
            "end_time": "2023-05-25T06:33:00",
            "uri": "/data/oper/avhrr/ears/level0/AVHR_HRP_00_M03_20230525063200Z_20230525063300Z_N_O_20230525063403Z",
            "uid": "AVHR_HRP_00_M03_20230525063200Z_20230525063300Z_N_O_20230525063403Z"
        }
    ]
}

For compact VIIRS data having two channel segments for a single time the collection consists of datasets.

collection_of_datasets = {
    "start_time": "2023-05-11T01:40:54.200000",
    "end_time": "2023-05-11T01:50:51.500000",
    "platform_name": "NOAA-20",
    "sensor": ["viirs"],
    "collection": [
        {
            "dataset": [
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVDNBC_j01_d20230511_t0140542_e0142187_b28372_c20230511015204000213_eum_ops.h5",
                    "uid": "SVDNBC_j01_d20230511_t0140542_e0142187_b28372_c20230511015204000213_eum_ops.h5"
                },
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVMC_j01_d20230511_t0140542_e0142187_b28372_c20230511015212000170_eum_ops.h5",
                    "uid": "SVMC_j01_d20230511_t0140542_e0142187_b28372_c20230511015212000170_eum_ops.h5"
                }
            ],
            "start_time": "2023-05-11T01:40:54.200000",
            "end_time": "2023-05-11T01:42:18.700000"
        },
        ...
        {
            "dataset": [
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVDNBC_j01_d20230511_t0149270_e0150515_b28372_c20230511015839000126_eum_ops.h5",
                    "uid": "SVDNBC_j01_d20230511_t0149270_e0150515_b28372_c20230511015839000126_eum_ops.h5"
                },
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVMC_j01_d20230511_t0149270_e0150515_b28372_c20230511015848000237_eum_ops.h5",
                    "uid": "SVMC_j01_d20230511_t0149270_e0150515_b28372_c20230511015848000237_eum_ops.h5"
                }
            ],
            "start_time": "2023-05-11T01:49:27",
            "end_time": "2023-05-11T01:50:51.500000"
        }
    ]
}
pnuu commented 1 year ago

The multicollection message type could be something like this:

multicollection = {
    "start_times": ["2023-05-25T10:50:00", ... "2023-05-25T11:50:00"],
    "end_times": [],
    "platform_name": "Meteosat-11",
    "sensor": ["seviri"],
    "multicollection":
    [
        {
            "start_time": "2023-05-25T10:50:00",
            "platform_name": "Meteosat-11",
            "sensor": ["seviri"],
            "dataset": [
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__"
                },
                ...
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__"
                }
            ],
        },
        ...
        {
            "start_time": "2023-05-25T11:50:00",
            "platform_name": "Meteosat-11",
            "sensor": ["seviri"],
            "dataset": [
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251150-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251150-__"
                },
                ...
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251150-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251150-__"
                }
            ],
        }
    ]
}

The lists start_times and end_times on the top level might help later on with sorting, data selection, or something. With my chosen path of using segment gatherer internals in the code, it is not possible to collect data from different streams. If the collection happened in a separate process by listening to multiple segment or geographic gatherers, we could get multicollections like this:

multicollection_2 = {
    "start_times": ["2023-05-25T10:50:00", ..., "2023-05-06T21:52:10.300000"],
    "end_times": [None, ..., "2023-05-06T21:53:34.800000"],
    "platform_names": ["Meteosat-11", ..., "NOAA-20"],
    "sensors": ["seviri", ..., "viirs"],
    "multicollection":
    [
        {
            "start_time": "2023-05-25T10:50:00",
            "platform_name": "Meteosat-11",
            "sensor": ["seviri"],
            "dataset": [
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-PRO______-202305251050-__"
                },
                ...
                {
                    "uri": "/data/oper/seviri/rss/level1.5/H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__",
                    "uid": "H-000-MSG4__-MSG4_RSS____-_________-EPI______-202305251050-__"
                }
            ],
        },
        ...
        {
            "start_time": "2023-05-06T21:52:10.300000",
            "end_time": "2023-05-06T21:53:34.800000",
            "platform_name": "NOAA-20",
            "sensor": ["viirs"]
            "dataset": [
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVDNBC_j01_d20230506_t2152103_e2153348_b28312_c20230506220612000459_eum_ops.h5",
                    "uid": "SVDNBC_j01_d20230506_t2152103_e2153348_b28312_c20230506220612000459_eum_ops.h5"
                },
                ...
                {
                    "uri": "/data/oper/viirs/ears/level1b/SVMC_j01_d20230506_t2152103_e2153348_b28312_c20230506220623000658_eum_ops.h5",
                    "uid": "SVMC_j01_d20230506_t2152103_e2153348_b28312_c20230506220623000658_eum_ops.h5"
                }
            ],
        }
    ]
}

This structure could be used in collecting geo ring data for example, which could then be processed with Satpy MultiScene in one go. Now that I think of it, this would need a completely different logic compared to the initial purpose of this PR (publish multiple scenes with the same time for example), so I'll go with the former.

pnuu commented 1 year ago

As @mraspaud said in https://github.com/pytroll/pytroll-collectors/issues/140#issuecomment-1560735188 , I'll swap this collection type to temporal_collection and start building the metadata collection.