use dmci to ingest frost-oda data in catalog

jo-asplin-met-no commented 1 year ago

Current plan: keep two netCDF files per time series (= combination of station and parameter): one for original observations and one for the latest version of each observation. Updates to the latter file will be registered for traceability.

mortenwh commented 1 year ago

A little more details, to be sure we have the same understanding. Maybe also we can have some flexibility in the dataset definitions. For example, I think it would be natural to have both wind speed and direction in the same file/dataset if the measurements are from the same instrument.

Dataset 1:

This is an open-ended netcdf-cf file with original observations. Should be fully documented with ACDD and CF metadata as described in the data management handbook. The id is of type uuid.
Create MMD file and push it to dmci
New observations are appended to the netcdf variable(s), let's say in batches of same duration as will be agreed in E-SOH. I suggest 10 minutes as a start.
The batches of new observations should also be pushed on NATS and MQTT through MMS as (coverage?)json payload with CF and ACDD metadata. These should be considered separate datasets with their own unique ids, linked to dataset 1 by related_dataset metadata attribute.

Dataset 2a:

This is the first iteration of dataset 1, with some updated observations in a variable.
It is a new netcdf file with a new uuid, so a new MMD file must be created and pushed to dmci
New observations are added to the netcdf file as above, no need to push on the queue, since it is the same data (corrections are coming later, right?)
If some of the observations are updated:
- Copy the netcdf file to "dataset 2b" below
- Set time_coverage_end
- Update the MMD file
- Delete the netcdf file

Dataset 2b:

This is the second iteration of dataset 2
It is another netcdf file with a new uuid, so a new MMD file must be created and pushed to dmci
New observations added as above
Same procedure to create dataset 2c and so on..

Can you review these steps, and make issues of them so we can track progress and prioritize?

jo-asplin-met-no commented 1 year ago

There's one thing I don't get: What's the use case of being able to see the "correction sequence" of a dataset when you can only access the first and last version of the dataset itself (i.e. the actual data)? I.e. a given dataset (time series of point observations in this case) might have been corrected (wrt one or more observations) N times, but you don't actually see the intermediate corrections, only the last one. Who would be interested in knowing the number N or the times at which those corrections where reported to DMCI? Or is only the latest correction (with corresponding MMD/UUID) kept by the DMCI (i.e. the persistent storage DMCI is writing to)? In that case I guess the "correction sequence" will always be of length 1 or 2 (i.e. no intermediate MMD/UUID info will be kept anywhere).

mortenwh commented 1 year ago

So by "correction sequence", I guess you refer to dataset 2a in my example. First of all, the FAIR principles says that data can be deleted but metadata must be kept. Why? This is for allowing ability to trace data. If someone used dataset 2a for something, it should be possible to find information about that dataset and perhaps, in some cases, how to find replacements. All MMD files will be kept, so there will be a lot of MMD files in the end. However, that is not a problem as far as I have understood from those with more experience with it.

Does this answer your question?

jo-asplin-met-no commented 1 year ago

Not really. I fail to see what metadata are changed in those cases. Unless you consider the mere fact that some updates to the dataset were detected at some point (i.e. by the time the update script happened run). All other metadata describing the dataset stay the same, no?

mortenwh commented 1 year ago

At least the following two MMD fields will be updated for dataset 2a:

metadata_status set to inactive (I think - it is unclear in the MMD docs, ask @ferrighi )
time_coverage_end will be set to the time of last observation

We need to check if there is more. In the new dataset, the metadata_identifier will be new.

More clear now?

jo-asplin-met-no commented 1 year ago

No, because I still don't see why any S-ENDA end user would be interested in that history. Essentially this is just logging the operation of the Python script. There will be an entry in the history whenever the script reports corrections to the dataset. Maybe someone would be interested in see exactly at what times the script detected corrections, but that depends also on the execution frequency of the script itself. This type of info could maybe be of interest to maintainers of the script, but you could get that elsewhere, like in the crontab log or some other internal log.

jo-asplin-met-no commented 1 year ago

It would really help to identify a real and relevant use case for keeping all those MMD files rather than following the "ideal" FAIR principles blindly in this particular case.

mortenwh commented 1 year ago

What is the problem of making those MMD files? As I've pointed out many times, it is a requirement in FAIR, and FAIR is based on real use cases. Why not just follow it?

jo-asplin-met-no commented 1 year ago

Can you point me to a specific use case for this particular situation? If there are none, I suggest that we make an exception to FAIR in this case. If we can't identify a concrete and relevant use case, it doesn't make sense to me to keep this history. Increasing the code complexity for the purpose of keeping information that nobody requests simply seems wrong.

metno / S-ENDA-documentation

use dmci to ingest frost-oda data in catalog #267