Closed davaya closed 1 year ago
Although technically Datatype is a class, most tooling I've used in defining and maintaining RDF ontologies clearly distinguish general "object classes" from Datatypes - so having a separate Datatype directory would be consistent with some of the common practices.
In terms of the exact format, I did some experiments using Protege to define a new Datatype.
What I came up with was defining a Datatype with a comment annotation containing the pattern restriction and a definition using the built-in xsd:string. This results in the following Datatype definition in RDF/OWL/TTL format:
:test rdf:type rdfs:Datatype ;
rdfs:comment "Restricted by the pattern \".*\"" ;
owl:equivalentClass xsd:string .
So if a property can have type DateTime, it might as well also be able to have type String instead of xsd:string.
I'm not sure I understand the benefit of the above statement.
I believe we should use the built-in XSD datatype wherever possible - e.g. continue to use xsd:string.
The type name is a synonym for the built-in xsd type, so one can be translated to the other where it makes sense as long as the translation xsd:string = String
is defined somewhere.. It could be hidden in software or made explicit in the model files, I prefer explicit.
One benefit is a consistent language-independent representation of the model - xsd is specific to XML while other formats (e.g., SQL, SDL, ASN.1, ...) have their own representations of types. If CreationInfo and Relationship and DateTime are all types in a model regardless of what languages it targets, it's just aesthetically more consistent (and more accessible to newcomers) for SemVer and String to have names in the same style.
But the main benefit and motivation is practical: types can be re-used. Giving SemVer a name allows it to be part of a library of things that don't have to be re-invented. Another profile that hasn't been thought of yet, or maybe even the AI profile, might have an application of semantic version numbers that don't involve the SPDX specVersion. A general practice of defining everything as a type means we don't have to predict whether it might be useful in more than one place, and if it is, updates only need to be made once. If it's never used again, defining it as a type costs nothing, there is no reason not to do it.
I am against the idea that we create our own "language-independent representation of the model".
We have decided long ago that we use RDF for this model. The only point of the markdown is to make it accessible to people who do not want to deal with RDF intricacies.
Then strike my "language-independent" comment. We're left with:
Who could be against those?
@goneall since you are using protege, Stackoverflow had the following datatype definition (for number with range 0-100) - presumably the same rdfs:Datatype owl:withRestrictions structure works for strings with a pattern, though I'd think you could use xsd restrictions since that's what rdfs says ...
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns="http://stackoverflow.com/q/24531940/1281433/percentages#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
<owl:Ontology rdf:about="http://stackoverflow.com/q/24531940/1281433/percentages"/>
<owl:DatatypeProperty rdf:about="http://stackoverflow.com/q/24531940/1281433/percentages#hasPercentage">
<rdfs:range>
<rdfs:Datatype>
<owl:onDatatype rdf:resource="http://www.w3.org/2001/XMLSchema#double"/>
<owl:withRestrictions rdf:parseType="Collection">
<rdf:Description>
<xsd:minInclusive rdf:datatype="http://www.w3.org/2001/XMLSchema#integer"
>0</xsd:minInclusive>
</rdf:Description>
<rdf:Description>
<xsd:maxInclusive rdf:datatype="http://www.w3.org/2001/XMLSchema#integer"
>100</xsd:maxInclusive>
</rdf:Description>
</owl:withRestrictions>
</rdfs:Datatype>
</rdfs:range>
</owl:DatatypeProperty>
</rdf:RDF>
@zvr:
We have decided long ago that we use RDF for this model.
There's no questioning that. The RDF model right now uses named datatypes for Vocabularies (SoftwarePurpose, AnnotationType, RelationshipType, etc). So it is obviously possible to restrict an xsd:string to be a named SubclassOf xsd:string.
If "the model" that is generated from the markdown files cannot do DateTime and SemVer in the identical way it does AnnotationType and RelationshipType, then that is a bug in the software, not a bug in "the RDF model" or "the markdown model". They are one and the same, they are both "the model". The software should have as a design requirement to generate one from the other, in both directions.
@davaya I do not understand your point at all. What does xsd:string
have to do with Vocabularies?
In case it's not yet clear, the value of the relationshipType
property is an object of class RelationshipType
(or http://spdx.org/rdf/v3/Core/RelationshipType
to give its full URI).
One such individual is http://spdx.org/rdf/v3/Core/RelationshipType/contains
.
There are no strings anywhere.
There might be a serialization that uses a string like "CONTAINS"
, as the tag-value serialization did in SPDXv2, but this has nothing to do with the model.
Enumerations are a restriction on the xsd:string datatype. Patterns are a restriction on the xsd:string datatype. The lexical space of an instance of RelationshipType is a string such as "affects", "amends", "ancestor", "contains", etc. Only certain strings are valid instances of the RelationshipType class defined as an enumeration restriction on xsd:string, just as only certain strings are valid instances of the DateTime class defined by a pattern restriction on xsd:string. None of those Classes have properties, it's obvious from the fact that they are a SubclassOf a simple datatype (xsd:string) whether or not the Class markdowns are all in one directory or several.
XML Schema2 predefines about forty simple types, the ones suitable for RDF and OWL are listed in [RDF Semantics]. In addition, XML Schema permits users to refine these builtin types by taking a restriction including only some of the values or some of the lexical forms.
A datatype is understood to define a partial mapping, called the lexical-to-value mapping, from a lexical space (a set of character strings) to values. The function L2V maps datatypes to their lexical-to-value mapping. A literal with datatype d denotes the value obtained by applying this mapping to the character string sss: L2V(d)(sss).
@zvr says:
In case it's not yet clear, the value of the relationshipType property is an object of class RelationshipType (or http://spdx.org/rdf/v3/Core/RelationshipType to give its full URI).
Which points out the terminology issue: RelationshipType is a Class (even though the markdown puts it in a Vocabularies directory to separate it from the Classes that have properties), and it is a SubclassOf xsd:string which means that it does not have properties and does have the enumeration restriction on xsd:string. (A different model could have an enumeration of integers or dates instead of strings.)
DateTime works the same way, it is a SubclassOf xsd:string which means it doesn't have properties, it does have a pattern restriction, and it can go in another directory like Datatypes in the markdown files.
One such individual is http://spdx.org/rdf/v3/Core/RelationshipType/contains. There are no strings anywhere.
The logical value is http://spdx.org/rdf/v3/Core/RelationshipType/contains
. The lexical form is "contains", a string. The mapping from lexical form to value is called the Lexical to Value Mapping, shown above.
The lexical form of "3.0.0" is a string. The logical value is defined by the L2V mapping L2V(d)(sss), or L2V(SemVer)("3.0.0"), or http://spdx.org/rdf/v3/Core/SemVer/3.0.0
.
But my main point is that there is no difference in implementation - the model has to define something that is an instance of xsd:string, and it has to be a restriction (enumeration or pattern or length, etc) on xsd:string. Giving it a TypeName doesn't change that fact, it just means that you don't have to repeat the definition of the restriction or ponder about how often it might be used, you define it once and refer to it using one or more property names.
You already give all enumeration Subclasses of xsd:string a name, every one of them whether they are used only once or more than once, nobody wastes time arguing about whether to name them or not. Just do the same for pattern sublcasses, like the model already does for DateTime, MediaType, and SemVer, and add SpdxId to them - it's used more than once.
@davaya Is this now resolved with the introduction of the DataType directory?
The existence of Datatypes and Vocabularies directories are fine, if the problem was the inability of software to properly process the markdown files without those directories. (Strictly speaking, following this precedent, there would need to be both SimpleDatatypes (for DateTime, MediaType, etc classes) and Datatypes (for CreationInfo, PositiveIntegerRange, etc classes), but I'm not advocating doing that.)
Classes, simple Datatypes, compound Datatypes, and Vocabularies are all Classes, but whatever markdown directories they are placed in is OK. Model parsers should check that compound Datatypes do not have a SubclassOf chain with Element as the root (i.e. that compound Datatypes are not Elements and aren't identified by SpdxId), but since they are currently in the Classes directory, their non-Element subclass tree is what distinguishes Element classes from compound Datatype classes.
My goal is to stop the nitpicking over terminology. "SubclassOf" means "inherit from" whether it is a simple datatype restricted from another simple datatype, an Element inherited from another Element, or a compound Datatype inherited from another compound Datatype.
TL;DR: it's resolved, with or without the Datatypes directory, if we stop arguing about the meaning of SubclassOf
.
It's resolved now that we have accepted that DateTime, MediaType, and SemVer can have names without arguing about how many times they are used.
And it's resolved since we still have PR #407 to answer the question raised in Issue #36. That question is as simple to resolve as the DateTime question - does giving SpdxId a name make the model easier to understand, and does it cause any harm. Arguing about whether giving names to DateTime or SpdxId is "necessary" is a waste of time - they aren't "necessary", but the names are helpful and not harmful.
@davaya If I understand you comment above, we should be able to close this issue and leave PR #407 open since 407 is more specific and actionable. We can always refer back to this issue and/or create another PR with a specific proposal.
As a follow-up to the July 14 serialization discussion of #368, RDFS says:
RDF Concepts says:
and XML Schema says:
length minLength maxLength pattern enumeration assertions
What that says to me is that xsd:string is a Datatype, which is a Class, and that new types derived from it using model-defined facets are also Classes identified by name.
The markdown files can have a Datatypes directory separate from the Classes directory to hold
pattern
-restricted classes, just as the separate Vocabularies directory holdsenumeration
-restricted classes. For uniformity of naming the model should define datatype names for the built-in primitive datatypes that we use. Conceptually there is no difference:So if a property can have type DateTime, it might as well also be able to have type String instead of xsd:string.