spdx / spdx-3-model

The model for the information captured in SPDX version 3 standard.
https://spdx.dev/use/specifications/
Other
69 stars 44 forks source link

RDF Classes #423

Closed davaya closed 1 year ago

davaya commented 1 year ago

As a follow-up to the July 14 serialization discussion of #368, RDFS says:

2.2 rdfs:Class This is the class of resources that are RDF classes. rdfs:Class is an instance of rdfs:Class. 2.3 rdfs:Literal The class rdfs:Literal is the class of literal values such as strings and integers. Property values such as textual strings are examples of RDF literals. rdfs:Literal is an instance of rdfs:Class. rdfs:Literal is a subclass of rdfs:Resource. 2.4 rdfs:Datatype rdfs:Datatype is the class of datatypes. All instances of rdfs:Datatype correspond to the RDF model of a datatype described in the RDF Concepts specification [RDF11-CONCEPTS]. rdfs:Datatype is both an instance of and a subclass of rdfs:Class. Each instance of rdfs:Datatype is a subclass of rdfs:Literal.

RDF Concepts says:

  1. Datatypes Datatypes are used with RDF literals to represent values such as strings, numbers and dates. The datatype abstraction used in RDF is compatible with XML Schema [XMLSCHEMA11-2].

and XML Schema says:

3.3.1.3 Facets The string datatype has the following ·constraining facets· with the values shown; these facets may be specified in the derivation of new types, if the value given is at least as restrictive as the one shown: whiteSpace = preserve Datatypes derived by restriction from string may also specify values for the following ·constraining facets·:

length minLength maxLength pattern enumeration assertions


What that says to me is that xsd:string is a Datatype, which is a Class, and that new types derived from it using model-defined facets are also Classes identified by name.

The markdown files can have a Datatypes directory separate from the Classes directory to hold pattern-restricted classes, just as the separate Vocabularies directory holds enumeration-restricted classes. For uniformity of naming the model should define datatype names for the built-in primitive datatypes that we use. Conceptually there is no difference:

The ·built-in· ·constructed· datatypes are those which are believed to be so common that if they were not defined in this specification many schema designers would end up reinventing them.

So if a property can have type DateTime, it might as well also be able to have type String instead of xsd:string.

goneall commented 1 year ago

Although technically Datatype is a class, most tooling I've used in defining and maintaining RDF ontologies clearly distinguish general "object classes" from Datatypes - so having a separate Datatype directory would be consistent with some of the common practices.

In terms of the exact format, I did some experiments using Protege to define a new Datatype.

What I came up with was defining a Datatype with a comment annotation containing the pattern restriction and a definition using the built-in xsd:string. This results in the following Datatype definition in RDF/OWL/TTL format:

:test rdf:type rdfs:Datatype ;
      rdfs:comment "Restricted by the pattern \".*\"" ;
      owl:equivalentClass xsd:string .
goneall commented 1 year ago

So if a property can have type DateTime, it might as well also be able to have type String instead of xsd:string.

I'm not sure I understand the benefit of the above statement.

I believe we should use the built-in XSD datatype wherever possible - e.g. continue to use xsd:string.

davaya commented 1 year ago

The type name is a synonym for the built-in xsd type, so one can be translated to the other where it makes sense as long as the translation xsd:string = String is defined somewhere.. It could be hidden in software or made explicit in the model files, I prefer explicit.

One benefit is a consistent language-independent representation of the model - xsd is specific to XML while other formats (e.g., SQL, SDL, ASN.1, ...) have their own representations of types. If CreationInfo and Relationship and DateTime are all types in a model regardless of what languages it targets, it's just aesthetically more consistent (and more accessible to newcomers) for SemVer and String to have names in the same style.

But the main benefit and motivation is practical: types can be re-used. Giving SemVer a name allows it to be part of a library of things that don't have to be re-invented. Another profile that hasn't been thought of yet, or maybe even the AI profile, might have an application of semantic version numbers that don't involve the SPDX specVersion. A general practice of defining everything as a type means we don't have to predict whether it might be useful in more than one place, and if it is, updates only need to be made once. If it's never used again, defining it as a type costs nothing, there is no reason not to do it.

zvr commented 1 year ago

I am against the idea that we create our own "language-independent representation of the model".

We have decided long ago that we use RDF for this model. The only point of the markdown is to make it accessible to people who do not want to deal with RDF intricacies.

davaya commented 1 year ago

Then strike my "language-independent" comment. We're left with:

Who could be against those?

davaya commented 1 year ago

@goneall since you are using protege, Stackoverflow had the following datatype definition (for number with range 0-100) - presumably the same rdfs:Datatype owl:withRestrictions structure works for strings with a pattern, though I'd think you could use xsd restrictions since that's what rdfs says ...

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns="http://stackoverflow.com/q/24531940/1281433/percentages#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
  <owl:Ontology rdf:about="http://stackoverflow.com/q/24531940/1281433/percentages"/>
  <owl:DatatypeProperty rdf:about="http://stackoverflow.com/q/24531940/1281433/percentages#hasPercentage">
    <rdfs:range>
      <rdfs:Datatype>
        <owl:onDatatype rdf:resource="http://www.w3.org/2001/XMLSchema#double"/>
        <owl:withRestrictions rdf:parseType="Collection">
          <rdf:Description>
            <xsd:minInclusive rdf:datatype="http://www.w3.org/2001/XMLSchema#integer"
            >0</xsd:minInclusive>
          </rdf:Description>
          <rdf:Description>
            <xsd:maxInclusive rdf:datatype="http://www.w3.org/2001/XMLSchema#integer"
            >100</xsd:maxInclusive>
          </rdf:Description>
        </owl:withRestrictions>
      </rdfs:Datatype>
    </rdfs:range>
  </owl:DatatypeProperty>
</rdf:RDF>
davaya commented 1 year ago

@zvr:

We have decided long ago that we use RDF for this model.

There's no questioning that. The RDF model right now uses named datatypes for Vocabularies (SoftwarePurpose, AnnotationType, RelationshipType, etc). So it is obviously possible to restrict an xsd:string to be a named SubclassOf xsd:string.

If "the model" that is generated from the markdown files cannot do DateTime and SemVer in the identical way it does AnnotationType and RelationshipType, then that is a bug in the software, not a bug in "the RDF model" or "the markdown model". They are one and the same, they are both "the model". The software should have as a design requirement to generate one from the other, in both directions.

zvr commented 1 year ago

@davaya I do not understand your point at all. What does xsd:string have to do with Vocabularies?

In case it's not yet clear, the value of the relationshipType property is an object of class RelationshipType (or http://spdx.org/rdf/v3/Core/RelationshipType to give its full URI).

One such individual is http://spdx.org/rdf/v3/Core/RelationshipType/contains. There are no strings anywhere.

There might be a serialization that uses a string like "CONTAINS", as the tag-value serialization did in SPDXv2, but this has nothing to do with the model.

davaya commented 1 year ago

Enumerations are a restriction on the xsd:string datatype. Patterns are a restriction on the xsd:string datatype. The lexical space of an instance of RelationshipType is a string such as "affects", "amends", "ancestor", "contains", etc. Only certain strings are valid instances of the RelationshipType class defined as an enumeration restriction on xsd:string, just as only certain strings are valid instances of the DateTime class defined by a pattern restriction on xsd:string. None of those Classes have properties, it's obvious from the fact that they are a SubclassOf a simple datatype (xsd:string) whether or not the Class markdowns are all in one directory or several.

User Defined Datatypes:

XML Schema2 predefines about forty simple types, the ones suitable for RDF and OWL are listed in [RDF Semantics]. In addition, XML Schema permits users to refine these builtin types by taking a restriction including only some of the values or some of the lexical forms.

Lexical to Value Mapping:

A datatype is understood to define a partial mapping, called the lexical-to-value mapping, from a lexical space (a set of character strings) to values. The function L2V maps datatypes to their lexical-to-value mapping. A literal with datatype d denotes the value obtained by applying this mapping to the character string sss: L2V(d)(sss).

@zvr says:

In case it's not yet clear, the value of the relationshipType property is an object of class RelationshipType (or http://spdx.org/rdf/v3/Core/RelationshipType to give its full URI).

Which points out the terminology issue: RelationshipType is a Class (even though the markdown puts it in a Vocabularies directory to separate it from the Classes that have properties), and it is a SubclassOf xsd:string which means that it does not have properties and does have the enumeration restriction on xsd:string. (A different model could have an enumeration of integers or dates instead of strings.)

DateTime works the same way, it is a SubclassOf xsd:string which means it doesn't have properties, it does have a pattern restriction, and it can go in another directory like Datatypes in the markdown files.

One such individual is http://spdx.org/rdf/v3/Core/RelationshipType/contains. There are no strings anywhere.

The logical value is http://spdx.org/rdf/v3/Core/RelationshipType/contains. The lexical form is "contains", a string. The mapping from lexical form to value is called the Lexical to Value Mapping, shown above.

The lexical form of "3.0.0" is a string. The logical value is defined by the L2V mapping L2V(d)(sss), or L2V(SemVer)("3.0.0"), or http://spdx.org/rdf/v3/Core/SemVer/3.0.0.

But my main point is that there is no difference in implementation - the model has to define something that is an instance of xsd:string, and it has to be a restriction (enumeration or pattern or length, etc) on xsd:string. Giving it a TypeName doesn't change that fact, it just means that you don't have to repeat the definition of the restriction or ponder about how often it might be used, you define it once and refer to it using one or more property names.

You already give all enumeration Subclasses of xsd:string a name, every one of them whether they are used only once or more than once, nobody wastes time arguing about whether to name them or not. Just do the same for pattern sublcasses, like the model already does for DateTime, MediaType, and SemVer, and add SpdxId to them - it's used more than once.

goneall commented 1 year ago

@davaya Is this now resolved with the introduction of the DataType directory?

davaya commented 1 year ago

The existence of Datatypes and Vocabularies directories are fine, if the problem was the inability of software to properly process the markdown files without those directories. (Strictly speaking, following this precedent, there would need to be both SimpleDatatypes (for DateTime, MediaType, etc classes) and Datatypes (for CreationInfo, PositiveIntegerRange, etc classes), but I'm not advocating doing that.)

Classes, simple Datatypes, compound Datatypes, and Vocabularies are all Classes, but whatever markdown directories they are placed in is OK. Model parsers should check that compound Datatypes do not have a SubclassOf chain with Element as the root (i.e. that compound Datatypes are not Elements and aren't identified by SpdxId), but since they are currently in the Classes directory, their non-Element subclass tree is what distinguishes Element classes from compound Datatype classes.

My goal is to stop the nitpicking over terminology. "SubclassOf" means "inherit from" whether it is a simple datatype restricted from another simple datatype, an Element inherited from another Element, or a compound Datatype inherited from another compound Datatype.

TL;DR: it's resolved, with or without the Datatypes directory, if we stop arguing about the meaning of SubclassOf. It's resolved now that we have accepted that DateTime, MediaType, and SemVer can have names without arguing about how many times they are used. And it's resolved since we still have PR #407 to answer the question raised in Issue #36. That question is as simple to resolve as the DateTime question - does giving SpdxId a name make the model easier to understand, and does it cause any harm. Arguing about whether giving names to DateTime or SpdxId is "necessary" is a waste of time - they aren't "necessary", but the names are helpful and not harmful.

goneall commented 1 year ago

@davaya If I understand you comment above, we should be able to close this issue and leave PR #407 open since 407 is more specific and actionable. We can always refer back to this issue and/or create another PR with a specific proposal.