use XSD datatypes not schema.org datatypes

VladimirAlexiev commented 4 months ago

Schema.org datatypes are not good:

they go against standard XSD datatypes that are the foundation of both XML and RDF.
they are tentative (don't specify a lexical representation), eg schema:Number doesn't way what kind of number
they are not implemented in semantic repositories, i.e. there are special indexes for xsd:date, xsd:decimal etc but not for schema:Date, schema:Number etc

https://github.com/schemaorg/schemaorg/issues/1781 explains in more detail what's wrong with them.

This also leads to confusion, eg in https://github.com/mlcommons/croissant/blob/main/docs/croissant.ttl:

croissant:Format a rdf:Class ;
  rdfs:label "Format" ;
  rdfs:comment "Specifies how to parse the format of the data from a string representation. For example, format may hold a date format string, a number format, or a bounding box format." ;
  rdfs:subClassOf schema:Text .

croissant:format a rdf:Property ;
  rdfs:label "format" ;
  rdfs:comment "A format to parse the values of the data from text, e.g., a date format or number format." ;
  schema:domainIncludes croissant:DataSource ;
  schema:rangeIncludes croissant:Format .

A datatype cannot be a subclass of another datatype
You can define a custom datatype based on a XSD datatype, but that's done based on restrictions (eg "Age is a subset of Integer by fixing minInclusive and maxInclusive"). As you don't define any restriction for Format, there's no need to define a new datatype.
In the class description, it's unclear what "hold" means: is the string stored directly in cr:format, or does it point to a node with type cr:Format that holds the string?

This issue involves the ontologies and JSONLD context. Here's a count of occurrences in the two ontologies:

    2 schema:Boolean                                                          
    1 schema:DateTime                                                         
   29 schema:Text                                                             
    2 schema:URL

Also, I think it's better to distinguish properties between owl:DatatypeProperty and owl:ObjectProperty. Many Schema.org props are permissive and allow either literal or object ("string or thing"), but I think Croissant props are more precise,

VladimirAlexiev commented 4 months ago

This use of a schema datatype is perhaps the only valid one:

rai:dataCollectionTimeframe a rdf:Property ;
  rdfs:label "dataCollectionTimeframe" ;
  rdfs:comment "Timeframe in terms of start and end date of the collection process, that it described as a DateTime indicating a time period in <a href=\"https://en.wikipedia.org/wiki/ISO_8601#Time_intervals\">ISO 8601 time interval format</a>. For example, a collection time frame ranging from 2020 - 2022 can be indicated in ISO 8601 interval format via \"2020/2022\"." ;
  schema:domainIncludes schema:Dataset ;
  schema:rangeIncludes schema:DateTime .

The reason is that there's no XSD datatype to cover date intervals.

https://schema.org/DateTime description doesn't allow interval: "A combination of date and time of day in the form [-]CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm] (see Chapter 5.4 of ISO 8601)."
but https://schema.org/datasetTimeInterval allows interval "The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)". This prop has DateTime as range (which confirms my claim that Schema datatypes are tentative)

benjelloun commented 4 months ago

Wow, the discussion in https://github.com/schemaorg/schemaorg/issues/1781 is quite amazing and instructive.

You make very valid points on the merits of xsd types vs. schema.org basic data types.

Personally, I would lean towards supporting both in Croissant, and specifying a clear mapping as you do in that discussion. If there is a consensus on the benefits, we can recommend using the xsd basic datasets types over the schema.org ones in the next version of Croissant.

In general, data typing in Croissant aims to be extensible, and not limited to a single namespace. For instance, users can "semantically" type their data by associate classes from schema.org, wikidata, or other vocabularies. That said, for basic data types we certainly want to favor consistency to reduce the burden on tools and users of the datasets.

As you noted, we do inherit some of the fuzzyness of schema.org, but try to make things a bit more precise where necessary.

Regarding format, "holds" means "contains". cr:Format is just a marker type, but its values are still strings (err... I mean sc:Text. :-)

pierrot0 commented 4 months ago

We are definitely going to need to differentiate between int8, int16, uint8... and xsd has short, long, unsignedLong, etc. So in that regard xsd seems useful indeed.

Looking at numpy types, is xsd enough though? What mechanism do we want to support to describe a field as being a int128, or a complex number for example?

VladimirAlexiev commented 4 months ago

@pierrot0 For that you'd need custom datatypes.

How about large multidimensional arrays (tensors)? NetCDF and HDF5 for example have mechanisms for capturing such in binary and for describing them.

In RDF, you can use binary64, but will need further fields to describe the shape and elements of arrays.
Maybe this is relevant: https://linkml.io/linkml/howtos/multidimensional-arrays.html

mlcommons / croissant

use XSD datatypes not schema.org datatypes #654