metno / S-ENDA-documentation

Temporary documentation and use case descriptions for the S-ENDA project - when concepts are tested and verified, content is gradually moved to more long term solutions.
https://s-enda-documentation.readthedocs.io/
2 stars 6 forks source link

use dmci to ingest frost-oda data in catalog #267

Closed ElodieFZ closed 1 year ago

jo-asplin-met-no commented 1 year ago

Current plan: keep two netCDF files per time series (= combination of station and parameter): one for original observations and one for the latest version of each observation. Updates to the latter file will be registered for traceability.

mortenwh commented 1 year ago

A little more details, to be sure we have the same understanding. Maybe also we can have some flexibility in the dataset definitions. For example, I think it would be natural to have both wind speed and direction in the same file/dataset if the measurements are from the same instrument.

Dataset 1:

Dataset 2a:

Dataset 2b:

Can you review these steps, and make issues of them so we can track progress and prioritize?

jo-asplin-met-no commented 1 year ago

There's one thing I don't get: What's the use case of being able to see the "correction sequence" of a dataset when you can only access the first and last version of the dataset itself (i.e. the actual data)? I.e. a given dataset (time series of point observations in this case) might have been corrected (wrt one or more observations) N times, but you don't actually see the intermediate corrections, only the last one. Who would be interested in knowing the number N or the times at which those corrections where reported to DMCI? Or is only the latest correction (with corresponding MMD/UUID) kept by the DMCI (i.e. the persistent storage DMCI is writing to)? In that case I guess the "correction sequence" will always be of length 1 or 2 (i.e. no intermediate MMD/UUID info will be kept anywhere).

mortenwh commented 1 year ago

So by "correction sequence", I guess you refer to dataset 2a in my example. First of all, the FAIR principles says that data can be deleted but metadata must be kept. Why? This is for allowing ability to trace data. If someone used dataset 2a for something, it should be possible to find information about that dataset and perhaps, in some cases, how to find replacements. All MMD files will be kept, so there will be a lot of MMD files in the end. However, that is not a problem as far as I have understood from those with more experience with it.

Does this answer your question?

jo-asplin-met-no commented 1 year ago

Not really. I fail to see what metadata are changed in those cases. Unless you consider the mere fact that some updates to the dataset were detected at some point (i.e. by the time the update script happened run). All other metadata describing the dataset stay the same, no?

mortenwh commented 1 year ago

At least the following two MMD fields will be updated for dataset 2a:

We need to check if there is more. In the new dataset, the metadata_identifier will be new.

More clear now?

jo-asplin-met-no commented 1 year ago

No, because I still don't see why any S-ENDA end user would be interested in that history. Essentially this is just logging the operation of the Python script. There will be an entry in the history whenever the script reports corrections to the dataset. Maybe someone would be interested in see exactly at what times the script detected corrections, but that depends also on the execution frequency of the script itself. This type of info could maybe be of interest to maintainers of the script, but you could get that elsewhere, like in the crontab log or some other internal log.

jo-asplin-met-no commented 1 year ago

It would really help to identify a real and relevant use case for keeping all those MMD files rather than following the "ideal" FAIR principles blindly in this particular case.

mortenwh commented 1 year ago

What is the problem of making those MMD files? As I've pointed out many times, it is a requirement in FAIR, and FAIR is based on real use cases. Why not just follow it?

jo-asplin-met-no commented 1 year ago

Can you point me to a specific use case for this particular situation? If there are none, I suggest that we make an exception to FAIR in this case. If we can't identify a concrete and relevant use case, it doesn't make sense to me to keep this history. Increasing the code complexity for the purpose of keeping information that nobody requests simply seems wrong.