marco-bolo / dataset-catalogue

The index for MBO datasets
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Dataset identifiers #6

Open pieterprovoost opened 1 year ago

pieterprovoost commented 1 year ago

The spreadsheet template uses dataset identifiers to link together entries. Can we expect datasets to have DOIs at this stage, or do we want to add a DOI column so it can be added later without having to change identifiers throughout the sheet?

kmexter commented 11 months ago

DOI would go in the access sheet, as it is about how to access the data/metadata....but do you think it should be moved to here? Thing is, not all datasets will have a doi (or any PID)

pieterprovoost commented 11 months ago

DOI is commonly recommended as an option for a dataset PID (see GO FAIR, Google, Science on Schema) so I think it's perfectly fine to use as an identifier to link together the elements in the spreadsheet. If a DOI is not available then there are other PID options like a UUID, see https://www.go-fair.org/fair-principles/f1-meta-data-assigned-globally-unique-persistent-identifiers/.

In any case, I think we should have an "identifiers" column which allows adding additional identifiers for a dataset (handle, ARK, DOI, UUID, etc). This example from Google has DOI and ARK for example: https://developers.google.com/search/docs/appearance/structured-data/dataset

kmexter commented 11 months ago

The identifiers column is purely for internal use - especially in the sheet "links to other data/publications" which allows you to identify which datasets came from which other ones. I don't want to change this. In the sheet Access there is now a single column "DownloadURL" in which the DOI (/URL) to the dataset landing page should be added. 1- is that sufficient? 2- do we want more than one dataset DownloadURL? it is a lot easier if we just ask for one.......

kmexter commented 11 months ago

We need to decide if we are going to have the "distribution" in our json files or not, as currently we are only asking for a landing page, not a download page

kmexter commented 3 months ago

Given that we are now seeing that datasets are coming from diverse sources - some are from data repos/catalogues and have a DOI that points to a data landing page, some are URLs that point to a data selection tool (not data itself), some are publications, some are even GH repos. I think we are now at the point where we can look at what we are getting and think about how to include this info. Hopefully this is a discussion for our WP1 meeting on Sept 4

ptagliolato commented 1 month ago

If we opt for distinguishing different cases (PID landing page, direct download URL, some web page/web form to request data, and even some service to retrieve the data from), the last dcat version offers specific properties to accomplish this. See examples at https://www.w3.org/TR/vocab-dcat-3/#example-landing-page