psychoinformatics-de / datalad-tabby

DataLad extension package for the "tabby" dataset metadata specification
Other
1 stars 5 forks source link

Define `@id` for a `Dataset`(`Version`) #76

Closed mih closed 1 year ago

mih commented 1 year ago

Replaces: https://github.com/datalad/datalad-metalad/issues/380 Ping: https://github.com/datalad/datalad-registry/issues/217

A tabby record will typically be a version-level description (in HCLS terms. However, this is not necessarily the case (without a version label, we would be missing an essential component, and it would instantly be a summary-level description.

Such a difference would not necessarily impact the type annotation. Both could be dcat:Dataset or https://schema.org/Dataset. It would, however, matter for crafting a valid @id.

We need to have a common approach for @id choice within datalad's metadata ecosystem to simplify homogenization and merges across metadata sources (see https://github.com/datalad/datalad-metalad/issues/30 for other thoughts). Any approach to @id must not confuse the different description levels.

I posted some ideas in https://github.com/datalad/datalad-registry/issues/217#issuecomment-1641491693

Concrete issues:

christian-monch commented 1 year ago

One observation with regard to the HCLS Version Level Description (HCLS VLD), which I came across when working on the JSON-LD structure for the metalad_core-dataset extractor:

I am not sure that the HCLS VLD maps well onto commit-based versioning. The HCLS VLD concept seems to describe a selected state that was released. It seems more related to tags in datasets.

Just wanted to mention that, I am not proposing any change, because I think it can be justified to apply HCLS VLD to commit-based versions (although the cardinality-1 pav:previousVersions does not always make sense there). And it would definitely be easier for us because a dataset version level id could then just be something like:

https://dx.datalag.org/dataset/<uuid>@<commit-sha>

mih commented 1 year ago

Can you give a concrete example where it would not map well?

pav:previousVersions with gitshas as version identifiers would work nicely, IMHO. There is no way to alter the history without also altering the identifiers (automatically), i.e. given any commit-version, the previous commit is always fixed/known. I do not understand when it "does not always make sense there".

https://dx.datalag.org/dataset/<uuid>@<commit-sha> is proposed in https://github.com/datalad/datalad-registry/issues/217. However, I cannot convince myself that it is actually any better than https://dx.datalag.org/dataset/<commit-sha> (but certainly much longer). Do you have an argument in favor of the former over the latter?

christian-monch commented 1 year ago

Can you give a concrete example where it would not map well?

I thought about merge-commits. But that might not be relevant, because we can choose one of the parents.

https://dx.datalag.org/dataset/<uuid>@<commit-sha> is proposed in datalad/datalad-registry#217. However, I cannot convince myself that it is actually any better than https://dx.datalag.org/dataset/<commit-sha> (but certainly much longer). Do you have an argument in favor of the former over the latter?

I would actually also propose the latter, i.e. https://dx.datalag.org/dataset/<commit-sha>. That is actually similar to what is used in the studyminimeta-extractor-ouput (https://schema.datalad.org/datalad_dataset#<commit-sha>).

mih commented 1 year ago

re https://dx.datalag.org/dataset/<commit-sha> vs https://dx.datalag.org/dataset/<uuid>@<commit-sha>

The strongest argument that I can find for going for gitsha-only is:

Not having a UUID component in the version-level ID avoids this complication, with no loss of functionality or precission.

mih commented 1 year ago

This is very much in the context of describing datasets that are DataLad datasets. The main forum for that should be https://github.com/datalad/datalad-metalad/issues/389

101 also contains an example of an approach that does not require DataLad identifiers.