Closed mih closed 1 year ago
One observation with regard to the HCLS Version Level Description (HCLS VLD), which I came across when working on the JSON-LD structure for the metalad_core
-dataset extractor:
I am not sure that the HCLS VLD maps well onto commit-based versioning. The HCLS VLD concept seems to describe a selected state that was released. It seems more related to tags in datasets.
Just wanted to mention that, I am not proposing any change, because I think it can be justified to apply HCLS VLD to commit-based versions (although the cardinality-1 pav:previousVersions
does not always make sense there). And it would definitely be easier for us because a dataset version level id could then just be something like:
https://dx.datalag.org/dataset/<uuid>@<commit-sha>
Can you give a concrete example where it would not map well?
pav:previousVersions
with gitshas as version identifiers would work nicely, IMHO. There is no way to alter the history without also altering the identifiers (automatically), i.e. given any commit-version, the previous commit is always fixed/known. I do not understand when it "does not always make sense there".
https://dx.datalag.org/dataset/<uuid>@<commit-sha>
is proposed in https://github.com/datalad/datalad-registry/issues/217. However, I cannot convince myself that it is actually any better than https://dx.datalag.org/dataset/<commit-sha>
(but certainly much longer). Do you have an argument in favor of the former over the latter?
Can you give a concrete example where it would not map well?
I thought about merge-commits. But that might not be relevant, because we can choose one of the parents.
https://dx.datalag.org/dataset/<uuid>@<commit-sha>
is proposed in datalad/datalad-registry#217. However, I cannot convince myself that it is actually any better thanhttps://dx.datalag.org/dataset/<commit-sha>
(but certainly much longer). Do you have an argument in favor of the former over the latter?
I would actually also propose the latter, i.e. https://dx.datalag.org/dataset/<commit-sha>
. That is actually similar to what is used in the studyminimeta-extractor-ouput (https://schema.datalad.org/datalad_dataset#<commit-sha>
).
re https://dx.datalag.org/dataset/<commit-sha>
vs https://dx.datalag.org/dataset/<uuid>@<commit-sha>
The strongest argument that I can find for going for gitsha-only is:
Not having a UUID component in the version-level ID avoids this complication, with no loss of functionality or precission.
This is very much in the context of describing datasets that are DataLad datasets. The main forum for that should be https://github.com/datalad/datalad-metalad/issues/389
Replaces: https://github.com/datalad/datalad-metalad/issues/380 Ping: https://github.com/datalad/datalad-registry/issues/217
A tabby record will typically be a version-level description (in HCLS terms. However, this is not necessarily the case (without a
version
label, we would be missing an essential component, and it would instantly be a summary-level description.Such a difference would not necessarily impact the type annotation. Both could be
dcat:Dataset
or https://schema.org/Dataset. It would, however, matter for crafting a valid@id
.We need to have a common approach for
@id
choice within datalad's metadata ecosystem to simplify homogenization and merges across metadata sources (see https://github.com/datalad/datalad-metalad/issues/30 for other thoughts). Any approach to@id
must not confuse the different description levels.I posted some ideas in https://github.com/datalad/datalad-registry/issues/217#issuecomment-1641491693
Concrete issues:
@id
. However, in general we will not be able to infer the nature of such a DOI (concept DOI covering all versions vs. version-specific DOI). Moreover, such a DOI may be specific to a particular download (distribution-level identifier). One and the same dataset (version) could be hosted in more than one data portal and receive different DOIs that all point to the exact same information at different locations.tabby
metadata extractor would need to report at least two metadata records: the version-level description, and a concept-level description (the former linking the latter viaisVersionOf