Open VladimirAlexiev opened 6 months ago
This use of a schema datatype is perhaps the only valid one:
rai:dataCollectionTimeframe a rdf:Property ;
rdfs:label "dataCollectionTimeframe" ;
rdfs:comment "Timeframe in terms of start and end date of the collection process, that it described as a DateTime indicating a time period in <a href=\"https://en.wikipedia.org/wiki/ISO_8601#Time_intervals\">ISO 8601 time interval format</a>. For example, a collection time frame ranging from 2020 - 2022 can be indicated in ISO 8601 interval format via \"2020/2022\"." ;
schema:domainIncludes schema:Dataset ;
schema:rangeIncludes schema:DateTime .
The reason is that there's no XSD datatype to cover date intervals.
[-]CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm]
(see Chapter 5.4 of ISO 8601)."Wow, the discussion in https://github.com/schemaorg/schemaorg/issues/1781 is quite amazing and instructive.
You make very valid points on the merits of xsd types vs. schema.org basic data types.
Personally, I would lean towards supporting both in Croissant, and specifying a clear mapping as you do in that discussion. If there is a consensus on the benefits, we can recommend using the xsd basic datasets types over the schema.org ones in the next version of Croissant.
In general, data typing in Croissant aims to be extensible, and not limited to a single namespace. For instance, users can "semantically" type their data by associate classes from schema.org, wikidata, or other vocabularies. That said, for basic data types we certainly want to favor consistency to reduce the burden on tools and users of the datasets.
As you noted, we do inherit some of the fuzzyness of schema.org, but try to make things a bit more precise where necessary.
Regarding format, "holds" means "contains". cr:Format is just a marker type, but its values are still strings (err... I mean sc:Text. :-)
We are definitely going to need to differentiate between int8, int16, uint8... and xsd has short, long, unsignedLong, etc. So in that regard xsd seems useful indeed.
Looking at numpy types, is xsd enough though? What mechanism do we want to support to describe a field as being a int128, or a complex number for example?
@pierrot0 For that you'd need custom datatypes.
How about large multidimensional arrays (tensors)? NetCDF and HDF5 for example have mechanisms for capturing such in binary and for describing them.
@pierrot0 I've reread the discussion above.
users can "semantically" type their data by associate classes from schema.org, wikidata, or other vocabularies.
What precisely do you mean by this, can you give an example?
cr:Format
is just a marker type, but its values are still strings (err... I mean sc:Text. :-)
rdf:type
"application/json"^^cr:Format
is a valid literal with a custom datatype. But do you really want that?"2024-10-09"^^xsd:date
tells it to index the literal as a date (so eg it should come before "12024-10-09")"point(1 2)"geo:wktLiteral
tells it to put a GeoSPARQL literal (expressed as Well Known Text) in a geospatial index
Schema.org datatypes are not good:
xsd:date, xsd:decimal
etc but not forschema:Date, schema:Number
etchttps://github.com/schemaorg/schemaorg/issues/1781 explains in more detail what's wrong with them.
This also leads to confusion, eg in https://github.com/mlcommons/croissant/blob/main/docs/croissant.ttl:
cr:format
, or does it point to a node with typecr:Format
that holds the string?This issue involves the ontologies and JSONLD context. Here's a count of occurrences in the two ontologies:
Also, I think it's better to distinguish properties between
owl:DatatypeProperty
andowl:ObjectProperty
. Many Schema.org props are permissive and allow either literal or object ("string or thing"), but I think Croissant props are more precise,