Open kmexter opened 10 months ago
first email Can we chose the terms in schema.org we should use the describe data types in the dataset descriptions?
This list should include
From @pieterprovoost
I'm not entirely sure what the intention is here. Is this meant to go in Dataset.additionalType for example to categorize datasets, or rather in DataDownload.encodingFormat to indicate in what format the data are available? If it's the latter then we should use MIME types. If this is about categorizing datasets then I don't think we'll get there with schema.org classes, but I'm not sure what to suggest instead. Dublin Core has things like dcterms:StillImage, dcterms:Text, dcterms:MovingImage but maybe we are more interested in differentiating between sequence based data, imaging based data, acoustics based data, etc.
schema:Text is not applicable here as this is a DataType (as in boolean, number, date, etc).
From @marc-portier
there is indeed a range of different aspects in here, all of which could be useful at some point
a loosely (human like) depiction of what kind of file --> schema.org/CreativeCommons has a number of subtypes that could fit too? (schema.org is probably closer to ODIS approach, as well as to things like RO-Crate) technical formats like mime-types, possibly also including character encodings -- > https://schema.org/encodingFormat does seem to do the trick there deeper content-conformity and schema-descriptions --> things we have been experimenting with inside Fair-Ease
My guess is that we should keep 2 and 3 above as recommended resp optional at the time -- but make clear that as one grows in formally describing the distributions, one gets to unlock more and more useful side-effects?
But more importantly -- let us not try to mix these distinct aspects...
MARCO-BOLO-WP1 In the case of mime types (but I agree that it's not so easy-to-use for scientists involved in metadata creation) I would consider the IANA mime-types list (https://www.iana.org/assignments/media-types/media-types.xhtml#application)
The intent is instead to identify the definition of a dataset type “series” in general? In that case I could suggest the INSPIRE registry, but it is focussed to spatial datasets. Series is defined by http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series
I try to find out something more abstract.. at the moment I have no better idea.
Since we need this googlesheet release asap, Marc and I have chosen for https://www.iana.org/assignments/media-types/media-types.xhtml as the place to chose the MimeType from, and that is what the column is now called. Please shout if you disagree
@kmexter I wonder how useful this is if we are not collecting distribution URLs at the same time. How are we going to use this information? A MIME type is a property of a specific file, not of a dataset. Most datasets will include a variety of MIME types, Darwin Core archives for example are collections of text/csv
and text/xml
in application/zip
.
well, yes and no. It is useful to the person looking at the record ("ah, these are image data, yes I want image data"), but to ODIS it may not be useful information. It is a bit like the usefulness that keywords provides, in my mind. Yes, there could be several mimetypes, that is OK as indeed a single dataset can contain different types of data in it. We can chose to remove this - personally I think it is useful, but I don't object to being overridden
Also...we could collect the distribution URLs - I mean, there is a field for it in the ODIS online example, so I am a bit uncertain why we are not asking for this from the MBO peeps also. It depends on the purpose of the ODIS record, I guess: for data already published in a catalogue, this record is a secondary one, but for data NOT already published, then this would be the primary record.....
I agree that the media-type is only meaningful when associated to a downloadURL of the distribution (and then it is also obvious there is only one)
I also agree that in many cases the mime-type has only limited value -- but better then nothing? Next level would be the a schema conformity of the dataset (as suggested as one of the other apsects)
More comments welcome, everyone from MBO WP1! As I do need to know which to do, ideally by end of this week
I am leaning away from mime type now. For me, the point of this was to allow scientists to understand what is in the dataset before they bother to download it. So I would have this field as a literal -- because we cannot accommodate via shema all the data types. I would suggest using:
that, or get rid of this metadatum completely.
May I suggest the following, which also covers sequence data. If we can find a term for sequence data in some other ontology, it can go into additionalType
.
{
"@type": "Dataset",
"hasPart": [
{
"@type": "ImageObject"
},
{
"@type": "TextObject",
"encodingFormat": "text/csv"
},
{
"@type": "TextObject",
"encodingFormat": "text/fasta"
},
]
}
Alternatively, we can use distribution
:
{
"@type": "Dataset",
"distribution": [
{
"@type": "DataDownload",
"additionalType": "ImageObject"
},
{
"@type": "DataDownload",
"additionalType": "TextObject",
"encodingFormat": "text/csv"
},
{
"@type": "DataDownload",
"additionalType": "TextObject",
"encodingFormat": "text/fasta"
},
]
}
In any case, don't use schema:Text
.
Thumbs-up to that suggestion, Pieter.
minor glitch probably:
- schema:audioObject
Type-names are typically Uppercase --> https://schema.org/AudioObject
When we have enough datasets from where we can harvest metadata (and perhaps ping those data), we can go further on this. For source data I think it is unlikely we will get this info, as it is not routinely held in metadata records (with some notable exceptions) but we should ask ourselves if we really want to push the WPs into providing this for the data that they create in MBO. We are already struggling to get info from them! TBD.
We think it is useful to add metadata describing the type of data that a dataset is describing, but we are not sure exactly what we want to describe here
In this issue we need to decide on this, and decide on the semantics to use.
We have to decide whether we want this metadatum to be useful as a piece of technical information (e.g. for OceanInfoHub) or for the audience (scientists, who are also those providing the descriptions in the first place). Personally, I think the second is better, mainly because the scientists describing the data will find that easier.
I copy below the discussion we have had so far in email