mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
417 stars 39 forks source link

use XSD datatypes not schema.org datatypes #654

Open VladimirAlexiev opened 4 months ago

VladimirAlexiev commented 4 months ago

Schema.org datatypes are not good:

https://github.com/schemaorg/schemaorg/issues/1781 explains in more detail what's wrong with them.

This also leads to confusion, eg in https://github.com/mlcommons/croissant/blob/main/docs/croissant.ttl:

croissant:Format a rdf:Class ;
  rdfs:label "Format" ;
  rdfs:comment "Specifies how to parse the format of the data from a string representation. For example, format may hold a date format string, a number format, or a bounding box format." ;
  rdfs:subClassOf schema:Text .

croissant:format a rdf:Property ;
  rdfs:label "format" ;
  rdfs:comment "A format to parse the values of the data from text, e.g., a date format or number format." ;
  schema:domainIncludes croissant:DataSource ;
  schema:rangeIncludes croissant:Format .

This issue involves the ontologies and JSONLD context. Here's a count of occurrences in the two ontologies:

    2 schema:Boolean                                                          
    1 schema:DateTime                                                         
   29 schema:Text                                                             
    2 schema:URL   

Also, I think it's better to distinguish properties between owl:DatatypeProperty and owl:ObjectProperty. Many Schema.org props are permissive and allow either literal or object ("string or thing"), but I think Croissant props are more precise,

VladimirAlexiev commented 4 months ago

This use of a schema datatype is perhaps the only valid one:

rai:dataCollectionTimeframe a rdf:Property ;
  rdfs:label "dataCollectionTimeframe" ;
  rdfs:comment "Timeframe in terms of start and end date of the collection process, that it described as a DateTime indicating a time period in <a href=\"https://en.wikipedia.org/wiki/ISO_8601#Time_intervals\">ISO 8601 time interval format</a>. For example, a collection time frame ranging from 2020 - 2022 can be indicated in ISO 8601 interval format via \"2020/2022\"." ;
  schema:domainIncludes schema:Dataset ;
  schema:rangeIncludes schema:DateTime .

The reason is that there's no XSD datatype to cover date intervals.

benjelloun commented 4 months ago

Wow, the discussion in https://github.com/schemaorg/schemaorg/issues/1781 is quite amazing and instructive.

You make very valid points on the merits of xsd types vs. schema.org basic data types.

Personally, I would lean towards supporting both in Croissant, and specifying a clear mapping as you do in that discussion. If there is a consensus on the benefits, we can recommend using the xsd basic datasets types over the schema.org ones in the next version of Croissant.

In general, data typing in Croissant aims to be extensible, and not limited to a single namespace. For instance, users can "semantically" type their data by associate classes from schema.org, wikidata, or other vocabularies. That said, for basic data types we certainly want to favor consistency to reduce the burden on tools and users of the datasets.

As you noted, we do inherit some of the fuzzyness of schema.org, but try to make things a bit more precise where necessary.

Regarding format, "holds" means "contains". cr:Format is just a marker type, but its values are still strings (err... I mean sc:Text. :-)

pierrot0 commented 4 months ago

We are definitely going to need to differentiate between int8, int16, uint8... and xsd has short, long, unsignedLong, etc. So in that regard xsd seems useful indeed.

Looking at numpy types, is xsd enough though? What mechanism do we want to support to describe a field as being a int128, or a complex number for example?

VladimirAlexiev commented 4 months ago

@pierrot0 For that you'd need custom datatypes.

How about large multidimensional arrays (tensors)? NetCDF and HDF5 for example have mechanisms for capturing such in binary and for describing them.