mih commented 1 year ago

Replaces: https://github.com/datalad/datalad-metalad/issues/380 Ping: https://github.com/datalad/datalad-registry/issues/217

A tabby record will typically be a version-level description (in HCLS terms. However, this is not necessarily the case (without a version label, we would be missing an essential component, and it would instantly be a summary-level description.

Such a difference would not necessarily impact the type annotation. Both could be dcat:Dataset or https://schema.org/Dataset. It would, however, matter for crafting a valid @id.

We need to have a common approach for @id choice within datalad's metadata ecosystem to simplify homogenization and merges across metadata sources (see https://github.com/datalad/datalad-metalad/issues/30 for other thoughts). Any approach to @id must not confuse the different description levels.

I posted some ideas in https://github.com/datalad/datalad-registry/issues/217#issuecomment-1641491693

Concrete issues:

a metadata record may have one or more dataset DOIs on record. This could serve as @id. However, in general we will not be able to infer the nature of such a DOI (concept DOI covering all versions vs. version-specific DOI). Moreover, such a DOI may be specific to a particular download (distribution-level identifier). One and the same dataset (version) could be hosted in more than one data portal and receive different DOIs that all point to the exact same information at different locations.
a tabby metadata extractor would need to report at least two metadata records: the version-level description, and a concept-level description (the former linking the latter via isVersionOf

christian-monch commented 1 year ago

One observation with regard to the HCLS Version Level Description (HCLS VLD), which I came across when working on the JSON-LD structure for the metalad_core-dataset extractor:

I am not sure that the HCLS VLD maps well onto commit-based versioning. The HCLS VLD concept seems to describe a selected state that was released. It seems more related to tags in datasets.

Just wanted to mention that, I am not proposing any change, because I think it can be justified to apply HCLS VLD to commit-based versions (although the cardinality-1 pav:previousVersions does not always make sense there). And it would definitely be easier for us because a dataset version level id could then just be something like:

https://dx.datalag.org/dataset/<uuid>@<commit-sha>

mih commented 1 year ago

Can you give a concrete example where it would not map well?

pav:previousVersions with gitshas as version identifiers would work nicely, IMHO. There is no way to alter the history without also altering the identifiers (automatically), i.e. given any commit-version, the previous commit is always fixed/known. I do not understand when it "does not always make sense there".

https://dx.datalag.org/dataset/<uuid>@<commit-sha> is proposed in https://github.com/datalad/datalad-registry/issues/217. However, I cannot convince myself that it is actually any better than https://dx.datalag.org/dataset/<commit-sha> (but certainly much longer). Do you have an argument in favor of the former over the latter?

christian-monch commented 1 year ago

Can you give a concrete example where it would not map well?

I thought about merge-commits. But that might not be relevant, because we can choose one of the parents.

https://dx.datalag.org/dataset/<uuid>@<commit-sha> is proposed in datalad/datalad-registry#217. However, I cannot convince myself that it is actually any better than https://dx.datalag.org/dataset/<commit-sha> (but certainly much longer). Do you have an argument in favor of the former over the latter?

I would actually also propose the latter, i.e. https://dx.datalag.org/dataset/<commit-sha>. That is actually similar to what is used in the studyminimeta-extractor-ouput (https://schema.datalad.org/datalad_dataset#<commit-sha>).

mih commented 1 year ago

re https://dx.datalag.org/dataset/<commit-sha> vs https://dx.datalag.org/dataset/<uuid>@<commit-sha>

The strongest argument that I can find for going for gitsha-only is:

any extractor executed on any real-world "dataset" will always have a gitsha to work with
a DataLad UUID is standard in DataLad datasets, but not universally guaranteed
if we make a "concept" UUID as requirement for a version-level ID, we either exclude any plain Git(Annex)Repo, or we require a standard mechanism to generate a UUID

Not having a UUID component in the version-level ID avoids this complication, with no loss of functionality or precission.

mih commented 1 year ago

This is very much in the context of describing datasets that are DataLad datasets. The main forum for that should be https://github.com/datalad/datalad-metalad/issues/389

101 also contains an example of an approach that does not require DataLad identifiers.

psychoinformatics-de / datalad-tabby

Define `@id` for a `Dataset`(`Version`) #76

101 also contains an example of an approach that does not require DataLad identifiers.