Describing data type: what exactly to describe and what controlled vocab(s) to use

kmexter commented 10 months ago

We think it is useful to add metadata describing the type of data that a dataset is describing, but we are not sure exactly what we want to describe here

the data format (i.e. the suffix)
a general data type (e.g. "spreadsheet")
something else

In this issue we need to decide on this, and decide on the semantics to use.

We have to decide whether we want this metadatum to be useful as a piece of technical information (e.g. for OceanInfoHub) or for the audience (scientists, who are also those providing the descriptions in the first place). Personally, I think the second is better, mainly because the scientists describing the data will find that easier.

I copy below the discussion we have had so far in email

kmexter commented 10 months ago

first email Can we chose the terms in schema.org we should use the describe data types in the dataset descriptions?

This list should include

text - I know there is schema:Text
spreadsheet: schema:SpreadSheetDigitalDocument? or application?
media (even images, video separately) schema:audio and schema:image and video?
not sure how omics files would be listed: probably text, since they are text of one sort or another
schema:Map (for indicators that are maps)?
Plots = I guess would be images?
any other types of data we know MBO cover?
DigitalDocument for documents?

kmexter commented 10 months ago

From @pieterprovoost

I'm not entirely sure what the intention is here. Is this meant to go in Dataset.additionalType for example to categorize datasets, or rather in DataDownload.encodingFormat to indicate in what format the data are available? If it's the latter then we should use MIME types. If this is about categorizing datasets then I don't think we'll get there with schema.org classes, but I'm not sure what to suggest instead. Dublin Core has things like dcterms:StillImage, dcterms:Text, dcterms:MovingImage but maybe we are more interested in differentiating between sequence based data, imaging based data, acoustics based data, etc.

schema:Text is not applicable here as this is a DataType (as in boolean, number, date, etc).

kmexter commented 10 months ago

From @marc-portier

there is indeed a range of different aspects in here, all of which could be useful at some point

a loosely (human like) depiction of what kind of file --> schema.org/CreativeCommons has a number of subtypes that could fit too? (schema.org is probably closer to ODIS approach, as well as to things like RO-Crate) technical formats like mime-types, possibly also including character encodings -- > https://schema.org/encodingFormat does seem to do the trick there deeper content-conformity and schema-descriptions --> things we have been experimenting with inside Fair-Ease

My guess is that we should keep 2 and 3 above as recommended resp optional at the time -- but make clear that as one grows in formally describing the distributions, one gets to unlock more and more useful side-effects?

But more importantly -- let us not try to mix these distinct aspects...

kmexter commented 10 months ago

MARCO-BOLO-WP1 In the case of mime types (but I agree that it's not so easy-to-use for scientists involved in metadata creation) I would consider the IANA mime-types list (https://www.iana.org/assignments/media-types/media-types.xhtml#application)

The intent is instead to identify the definition of a dataset type “series” in general? In that case I could suggest the INSPIRE registry, but it is focussed to spatial datasets. Series is defined by http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series

I try to find out something more abstract.. at the moment I have no better idea.

kmexter commented 10 months ago

Since we need this googlesheet release asap, Marc and I have chosen for https://www.iana.org/assignments/media-types/media-types.xhtml as the place to chose the MimeType from, and that is what the column is now called. Please shout if you disagree

pieterprovoost commented 9 months ago

@kmexter I wonder how useful this is if we are not collecting distribution URLs at the same time. How are we going to use this information? A MIME type is a property of a specific file, not of a dataset. Most datasets will include a variety of MIME types, Darwin Core archives for example are collections of text/csv and text/xml in application/zip.

kmexter commented 9 months ago

well, yes and no. It is useful to the person looking at the record ("ah, these are image data, yes I want image data"), but to ODIS it may not be useful information. It is a bit like the usefulness that keywords provides, in my mind. Yes, there could be several mimetypes, that is OK as indeed a single dataset can contain different types of data in it. We can chose to remove this - personally I think it is useful, but I don't object to being overridden

kmexter commented 9 months ago

Also...we could collect the distribution URLs - I mean, there is a field for it in the ODIS online example, so I am a bit uncertain why we are not asking for this from the MBO peeps also. It depends on the purpose of the ODIS record, I guess: for data already published in a catalogue, this record is a secondary one, but for data NOT already published, then this would be the primary record.....

marc-portier commented 9 months ago

I agree that the media-type is only meaningful when associated to a downloadURL of the distribution (and then it is also obvious there is only one)

I also agree that in many cases the mime-type has only limited value -- but better then nothing? Next level would be the a schema conformity of the dataset (as suggested as one of the other apsects)

kmexter commented 9 months ago

More comments welcome, everyone from MBO WP1! As I do need to know which to do, ideally by end of this week

kmexter commented 9 months ago

I am leaning away from mime type now. For me, the point of this was to allow scientists to understand what is in the dataset before they bother to download it. So I would have this field as a literal -- because we cannot accommodate via shema all the data types. I would suggest using:

schema:Text
schema:SpreadSheetDigitalDocument
schema:audioObject
schema:ImageObject
not sure how omics files would be listed: text is not suitable, we will have to divert from schema here....
schema:Map

that, or get rid of this metadatum completely.

pieterprovoost commented 9 months ago

May I suggest the following, which also covers sequence data. If we can find a term for sequence data in some other ontology, it can go into additionalType.

{
    "@type": "Dataset",
    "hasPart": [
        {
            "@type": "ImageObject"
        },
        {
            "@type": "TextObject",
            "encodingFormat": "text/csv"
        },
        {
            "@type": "TextObject",
            "encodingFormat": "text/fasta"
        },
    ]
}

Alternatively, we can use distribution:

{
    "@type": "Dataset",
    "distribution": [
        {
            "@type": "DataDownload",
            "additionalType": "ImageObject"
        },
        {
            "@type": "DataDownload",
            "additionalType": "TextObject",
            "encodingFormat": "text/csv"
        },
        {
            "@type": "DataDownload",
            "additionalType": "TextObject",
            "encodingFormat": "text/fasta"
        },
    ]
}

In any case, don't use schema:Text.

kmexter commented 9 months ago

Thumbs-up to that suggestion, Pieter.

marc-portier commented 9 months ago

minor glitch probably:

schema:audioObject

Type-names are typically Uppercase --> https://schema.org/AudioObject

kmexter commented 1 month ago

When we have enough datasets from where we can harvest metadata (and perhaps ping those data), we can go further on this. For source data I think it is unlikely we will get this info, as it is not routinely held in metadata records (with some notable exceptions) but we should ask ourselves if we really want to push the WPs into providing this for the data that they create in MBO. We are already struggling to get info from them! TBD.

marco-bolo / dataset-catalogue

Describing data type: what exactly to describe and what controlled vocab(s) to use #14