How to differentiate the use of a classification term between two objects once loaded into a graph?

beaudet commented 2 years ago

I feel like I'm overlooking something pretty basic with this question (possibly some nuance of SHACL validation), but if I have the following term applied to an art object as follows:

  "id": "http://vocab.getty.edu/aat/300033618",
  "type": "Type",
  "_label": "Painting",
  "classified_as": [
    {
      "id": "http://vocab.getty.edu/aat/300435443",
      "type": "Type",
      "_label": "Type of Work"
    }
  ]

and one of the parts for that object (or for that matter, another object in the graph), defines another sub-classification for the same term, e.g.

  "id": "http://vocab.getty.edu/aat/300033618",
  "type": "Type",
  "_label": "Painting",
  "classified_as": [
    {
      "id": "http://vocab.getty.edu/aat/300053001",
      "type": "Type",
      "_label": "Process and Techniques"
    }
  ]

How does one distinguish these differences in a knowledge graph which would presumably accumulate all of the classifications for the term under the node for the term rather than with the object where the sub-classifications are defined?

in other words, after loading the JSON-LD, there's now a node in the graph with URI

http://vocab.getty.edu/aat/300033618

which has two sub-classifications (Type of Work + Process & Techniques) hanging off of it.

With SHACL validation, so long as ONE of the two uses of the term defines "Type of Work", no warning is issued since the required sub-classification for the object part is already present in the graph since it was defined by the parent.

In general, if the IDs of terms are used directly in the graph rather than blank nodes, how does one go about differentiating between use of a term in one object rather than another?

Perhaps the most obvious way to visualize this is with the turtle output of the graph, but it's even apparent in the RDF in the sense that a duplicate sub-classification for a term isn't repeated. The location of the statements is clearer in the JSON-LD, but the nested classifications presumably cannot be round-tripped into a graph and then back out to JSON-LD. So, I guess the question is whether this is a problem or not. It certainly seems like a problem because we lose the contextual importance of the sub-classification term since it's no longer tied to the object, but to the term itself which is then shared across potentially many objects.

beaudet commented 2 years ago

An example graph in JSON-LD with the corresponding Turtle serialization showing how the data is accumulated in the graph, i.e.

http://vocab.getty.edu/aat/300033618 rdf:type crm:E55_Type ; rdfs:label "Painting" ; crm:P2_has_type http://vocab.getty.edu/aat/300138075 , http://vocab.getty.edu/aat/300435443 .

SOURCE JSON-LD

{
  "@context": "https://linked.art/ns/v1/linked-art.json",
  "id": "https://linked.art/example/object/parent",
  "type": "HumanMadeObject",
  "_label": "Example Painting",
  "classified_as": [
    {
      "id": "http://vocab.getty.edu/aat/300033618",
      "type": "Type",
      "_label": "Painting",
      "classified_as": [
        {
            "id": "http://vocab.getty.edu/aat/300435443",
            "type": "Type",
            "_label": "Type of Work"
         }
      ]
    }
  ],
  "part": [
    {
      "type": "HumanMadeObject",
      "id": "https://linked.art/example/object/child",
      "_label": "Example Painting using Painting term used in the context of a technique",
      "classified_as": [
        {
          "id": "http://vocab.getty.edu/aat/300033618",
          "type": "Type",
          "_label": "Painting",
          "classified_as": [
            {
              "id": "http://vocab.getty.edu/aat/300138075",
              "type": "Type",
              "_label": "Processes and Techniques"
            }
          ]
        }
      ]
    }
  ]
}

AS TURTLE

@prefix crm:  <http://www.cidoc-crm.org/cidoc-crm/> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

<http://vocab.getty.edu/aat/300033618>
        rdf:type         crm:E55_Type ;
        rdfs:label       "Painting" ;
        crm:P2_has_type  <http://vocab.getty.edu/aat/300138075> , <http://vocab.getty.edu/aat/300435443> .

<https://linked.art/example/object/child>
        rdf:type         crm:E22_Human-Made_Object ;
        rdfs:label       "Example Painting using Painting term used in the context of a technique" ;
        crm:P2_has_type  <http://vocab.getty.edu/aat/300033618> .

<https://linked.art/example/object/parent>
        rdf:type                crm:E22_Human-Made_Object ;
        rdfs:label              "Example Painting" ;
        crm:P2_has_type         <http://vocab.getty.edu/aat/300033618> ;
        crm:P46_is_composed_of  <https://linked.art/example/object/child> .

<http://vocab.getty.edu/aat/300435443>
        rdf:type    crm:E55_Type ;
        rdfs:label  "Type of Work" .

<http://vocab.getty.edu/aat/300138075>
        rdf:type    crm:E55_Type ;
        rdfs:label  "Processes and Techniques" .

AS RDF

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:crm="http://www.cidoc-crm.org/cidoc-crm/"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
  <crm:E22_Human-Made_Object rdf:about="https://linked.art/example/object/parent">
    <rdfs:label>Example Painting</rdfs:label>
    <crm:P46_is_composed_of>
      <crm:E22_Human-Made_Object rdf:about="https://linked.art/example/object/child">
        <rdfs:label>Example Painting using Painting term used in the context of a technique</rdfs:label>
        <crm:P2_has_type>
          <crm:E55_Type rdf:about="http://vocab.getty.edu/aat/300033618">
            <rdfs:label>Painting</rdfs:label>
            <crm:P2_has_type>
              <crm:E55_Type rdf:about="http://vocab.getty.edu/aat/300138075">
                <rdfs:label>Processes and Techniques</rdfs:label>
              </crm:E55_Type>
            </crm:P2_has_type>
            <crm:P2_has_type>
              <crm:E55_Type rdf:about="http://vocab.getty.edu/aat/300435443">
                <rdfs:label>Type of Work</rdfs:label>
              </crm:E55_Type>
            </crm:P2_has_type>
          </crm:E55_Type>
        </crm:P2_has_type>
      </crm:E22_Human-Made_Object>
    </crm:P46_is_composed_of>
    <crm:P2_has_type rdf:resource="http://vocab.getty.edu/aat/300033618"/>
  </crm:E22_Human-Made_Object>
</rdf:RDF>

beaudet commented 2 years ago

Once we have member_of collections, the meta-type assertion should be used to determine which collection to validate in which collection the term should exist. e.g. "type of work" term means look in the collection of "types of work" to see if the parent term exists there.

beaudet commented 2 years ago

related to #419

azaroth42 commented 1 year ago

Yeah, ... don't do that :) Either painting should be the object type, or the technique, but not both.

beaudet commented 10 months ago

acknowledged, but regardless of the motivation it seems likely that a classified_as term will be metatyped multiple times and that breaks the ability to round-trip the data in a way that will reconstitute only the metatypes applicable to object where those metatypes are used. In other words, the context where the metatype is being applied isn't really on the term itself, but on the application of that term to the linked art entity. I think we discussed this on the call, but it almost seems like each use of classified_as involving metatypes should mint a new identifier to capture the application of the term to the linked art entity so that metatypes will be scoped properly. Right now, it's a gray area. There can be contradictory metatypes that are valid data representations, but make no logical sense and there will be aggregation of metatypes collected for each term that when exploded in the JSON would, I imagine, duplicate those aggregated terms everywhere the term is used.

azaroth42 commented 10 months ago

The same problem exists anywhere we use vocabulary or any other sort of shared identities with a more specific ontology. You say that the museum is a Place, I say that it's a Group, and someone else is thinking of the physical building... if you aggregate all three classes associated with the same URI into a single graph, you have a huge mess ... but if the URI is ambiguous as to which of those it describes, none of us are individually wrong.

As far as I can tell, the possible solutions are worse than the symptoms:

To manage our own vocabulary and/or make decisions about how existing terms must be used in the context of linked art. For the original example, we could declare that aat:300435443 must only be used with classified_as on a HumanMadeObject, and that aat:300053001 must only be used as the technique (per #561) of an activity. In aggregate that would be an enormous task as we'd be rebuilding AAT.
Come up with new relationships for every potential reuse of a term, e.g. nationality, gender, type_of_work, type_of_part, style, genre, `shape, etc etc all would need new relationships. This is why we went to metatypes in the first place.
Use attribute assignments everywhere to avoid direct assertions. As with any reification approach, it adds a lot of overhead and complexity.
Named Graphs. But then we're replacing one technology problem (triplestores) with another (quadstores).

Creating a new entity every time seems even worse, as there's no way to know where in the big wide world people have used and misused these URIs, and when that data might come into the triplestore. Especially as it's not just metatypes, but basically any assertion. Consider the question about _label (#539) and the type of a resource in my opening :(

We can of course discuss possible solutions, but I don't know of any that don't have huge overheads.

beaudet commented 10 months ago

I think we should talk about possible solutions to maintain the integrity of the graph to be able to round-trip it to RDF. That almost seems like a basic expectation to me given that one of the first things data consumers will probably do is load it into a graph database and run SPARQL against it. I'm not speaking from a place of much experience at all on this, but it doesn't quite feel right to me that meta-statements made in the context of a term's relationship to a single entity would (not so obviously) bleed into any entity using that term. I could understand if the use of metatypes is clearly and intentionally about making global statements about a term but I don't think that's the case.

In other words, any linked art entity can be classified with terms, but rather than a direct classification, there's a node that encapsulates two relationships. So, to me at least, it seems like (3) is the most preferable / viable solution.

HumanMadeObject -> attributeAssignment -> [ classified_as -> term URI, usage_context -> term URI ] ??

Maybe a pattern like this could be mandatory when specifying the term's usage context (aka metatype), but if there's no context to convey, it falls back on the standard pattern (if there's an elegant way to do that).

azaroth42 commented 10 months ago

To try and be clear about my opinion:

The problem is not meta-typing, the problem is at the level of the data. If the data was clean, and there were sufficient and clear vocabulary terms for all of the situations, then we wouldn't have the problem, because the metatypes would never collide. The metatypes are there to ensure that consuming systems don't need to understand the impossibly long list of all types of object, types of work, nationalities, occupations, and so on, but instead understand the metatype to know what sort of classification it is. So, in my opinion, following the same principle as solving API problems in the API, model problems in the model, we should solve data problems in the data.

I think that metatypes do follow this principle: we're solving a problem with the model/ontology (lack of clarity around type of classification) in the model/ontology (with a metatype, which just reuses the existing classification pattern, and without introducing new terms).

Similarly, per @edwardanderson's comment in #539, the same issue applies with _label. Per my response two up, it applies even more dramatically to type. Basically, whenever there is a node with a URI, internal or external, someone somewhere can make inconsistent assertions about it that might get loaded to a triplestore.

It could also be argued that it's a technology problem -- that the issue is with the use of a triplestore as the system of record. With LUX, we don't run into the problem because we don't use a triplestore as the system of record, and we don't use SPARQL as our primary interaction with the data. Any client that uses the JSON as JSON (which I think will be the vast majority of them) will not care either. So implementations using triplestores need to be aware of the problem and to work around it ... but I don't believe that should be a cost for everyone to have to pay.

On the technology side, I consider Linked Art to be an interoperability format or exchange / interchange format, not a database schema. To bastardize Postel's law: be liberal with what you accept, and strict with what you ingest into your system of record. Linked Art isn't the schema for TMS or EMu or Museum+ or any other core system of record. To ingest a Linked Art record into TMS would be a much bigger lift than a Linked Art record into a triplestore to avoid this particular issue. The same solution, some data transformation to suit the technological requirements, would be relatively simple in comparison ... just not part of the specification, but definitely useful as an implementation note.

For the object type / technique case for painting, it can be solved by cleaning up the data, as there are two separate AAT terms for painting (object type) - aat:300033618 - and painting (process) - aat:300054216. Granted that is not always the case, and sometimes there's one or the other but not both. That is a problem with the data and vocabulary layer, not with the model layer.

e.g.

So, if we accept the above, the solution at the data layer or technology layer is "It hurts when I do this ... Don't Do That Then!" but how can we make it easier to not do the painful thing?

Data Documentation. When we run into the problem in real data, then document the solution. Work with Getty Vocabs to ensure that there are appropriate terms in AAT to cover the various requirements, and if not, then mint our own.
Vocabulary analysis and Documentation. We can spend more time with #186 to be much clearer around the expected use of terms.
Technology Documentation. Highlight the challenge in the documentation so implementers using triplestores are aware of the problem. We can come up with a non-interchange solution (likely per Dave around reification) that can be written up, coded up, and shared without adding complexity to the baseline case of JSON-as-JSON.
Validators. Create (per SHACL validation) graph based validators as well as syntax level JSON validators to detect these things.

azaroth42 commented 10 months ago

Copying the relevant type example from #560:

Some URIs in LUX have the same equivalent with different classes, for example:

https://lux.collections.yale.edu/view/group/de48ab54-2bb4-4334-be20-4c2912aae220 is the University of Cambridge as a Group.
https://lux.collections.yale.edu/view/place/014bc020-5b74-4cde-8a72-e7465a3bc09d is the University of Cambridge as a Place.
And both have https://www.wikidata.org/wiki/Q35794 as an equivalent, meaning that we assert:

wd:Q35794 rdf:type crm:E74_Group .
wd:Q35794 rdf:type crm:E53_Place .

Which is of course impossible in the CRM and would make a terrible mess of a triplestore.

We could discuss a URI level solution, but one which would make interoperability Very Hard at the graph level: Add structured fragments to the end of the URIs to keep them separate in the graph.

e.g. something like:

wd:Q35794#la:quaGroup rdf:type crm:E74_Group ;
  la:specializationOf wd:Q35794 .
wd:Q35794#la:quaPlace rdf:type crm:E53_Place ;
  la:specializationOf wd:Q35794 .

or for the original use case:

aat:300033618#la:quaTechnique rdf:type crm:E55_Type .

beaudet commented 10 months ago

What about using classified_as whenever there's no metatype in play and a different class carrying the metatyping and a term when there is? That would ensure the scope of metatype assertions are limited to the entity being described while enabling a more free-form handling of classified_as data. I'm imagining how an application would make use of both cases. For classified_as it might collect all terms in the result set and try to maintain the relationships between those terms if they are in the same section of a naming authority's hierarchy to infer the metatype. For example, terms appearing under a parent term labeled "types of objects" would be relatively straightforward to unwind and dispensing with metatypes is probably fine. For other situations where application of a term is more semantically nuanced, an Attribute Assignment or equivalent would carry the metatype and all the metatypes could be gathered up together to stitch the data together. That sounds feasible and not particularly complicated and it also would enable round-tripping of data.

Would there be any downside aside from having to restructure metatype assertions? I still think this would work with organizationally selected metatypes and if an organization wanted to be verbose, they could explicitly state the metatypes everywhere. The classified_as would just be limited to applying "keywords" to entities which is what those become semantically without a metatype I think.

azaroth42 commented 10 months ago

Not sure that I follow, sorry. Can you write out an example? Changing the relationships won't help -- if there's any relationship with the same URI as the subject, then it's going to collide due to the nature of triplestores not having any graph boundaries.

The triplestore-friendly structure would be something that asserted:

In the context of

linked-art / linked.art

How to differentiate the use of a classification term between two objects once loaded into a graph? #461