linked-art / linked.art

Development of a specification for linked data in museums, using existing ontologies and frameworks to build usable, understandable APIs
https://linked.art/
Other
95 stars 15 forks source link

How to differentiate the use of a classification term between two objects once loaded into a graph? #461

Open beaudet opened 2 years ago

beaudet commented 2 years ago

I feel like I'm overlooking something pretty basic with this question (possibly some nuance of SHACL validation), but if I have the following term applied to an art object as follows:

  "id": "http://vocab.getty.edu/aat/300033618",
  "type": "Type",
  "_label": "Painting",
  "classified_as": [
    {
      "id": "http://vocab.getty.edu/aat/300435443",
      "type": "Type",
      "_label": "Type of Work"
    }
  ]

and one of the parts for that object (or for that matter, another object in the graph), defines another sub-classification for the same term, e.g.

  "id": "http://vocab.getty.edu/aat/300033618",
  "type": "Type",
  "_label": "Painting",
  "classified_as": [
    {
      "id": "http://vocab.getty.edu/aat/300053001",
      "type": "Type",
      "_label": "Process and Techniques"
    }
  ]

How does one distinguish these differences in a knowledge graph which would presumably accumulate all of the classifications for the term under the node for the term rather than with the object where the sub-classifications are defined?

in other words, after loading the JSON-LD, there's now a node in the graph with URI

http://vocab.getty.edu/aat/300033618

which has two sub-classifications (Type of Work + Process & Techniques) hanging off of it.

With SHACL validation, so long as ONE of the two uses of the term defines "Type of Work", no warning is issued since the required sub-classification for the object part is already present in the graph since it was defined by the parent.

In general, if the IDs of terms are used directly in the graph rather than blank nodes, how does one go about differentiating between use of a term in one object rather than another?

Perhaps the most obvious way to visualize this is with the turtle output of the graph, but it's even apparent in the RDF in the sense that a duplicate sub-classification for a term isn't repeated. The location of the statements is clearer in the JSON-LD, but the nested classifications presumably cannot be round-tripped into a graph and then back out to JSON-LD. So, I guess the question is whether this is a problem or not. It certainly seems like a problem because we lose the contextual importance of the sub-classification term since it's no longer tied to the object, but to the term itself which is then shared across potentially many objects.

beaudet commented 2 years ago

An example graph in JSON-LD with the corresponding Turtle serialization showing how the data is accumulated in the graph, i.e.

http://vocab.getty.edu/aat/300033618 rdf:type crm:E55_Type ; rdfs:label "Painting" ; crm:P2_has_type http://vocab.getty.edu/aat/300138075 , http://vocab.getty.edu/aat/300435443 .

SOURCE JSON-LD

{
  "@context": "https://linked.art/ns/v1/linked-art.json",
  "id": "https://linked.art/example/object/parent",
  "type": "HumanMadeObject",
  "_label": "Example Painting",
  "classified_as": [
    {
      "id": "http://vocab.getty.edu/aat/300033618",
      "type": "Type",
      "_label": "Painting",
      "classified_as": [
        {
            "id": "http://vocab.getty.edu/aat/300435443",
            "type": "Type",
            "_label": "Type of Work"
         }
      ]
    }
  ],
  "part": [
    {
      "type": "HumanMadeObject",
      "id": "https://linked.art/example/object/child",
      "_label": "Example Painting using Painting term used in the context of a technique",
      "classified_as": [
        {
          "id": "http://vocab.getty.edu/aat/300033618",
          "type": "Type",
          "_label": "Painting",
          "classified_as": [
            {
              "id": "http://vocab.getty.edu/aat/300138075",
              "type": "Type",
              "_label": "Processes and Techniques"
            }
          ]
        }
      ]
    }
  ]
}

AS TURTLE

@prefix crm:  <http://www.cidoc-crm.org/cidoc-crm/> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

<http://vocab.getty.edu/aat/300033618>
        rdf:type         crm:E55_Type ;
        rdfs:label       "Painting" ;
        crm:P2_has_type  <http://vocab.getty.edu/aat/300138075> , <http://vocab.getty.edu/aat/300435443> .

<https://linked.art/example/object/child>
        rdf:type         crm:E22_Human-Made_Object ;
        rdfs:label       "Example Painting using Painting term used in the context of a technique" ;
        crm:P2_has_type  <http://vocab.getty.edu/aat/300033618> .

<https://linked.art/example/object/parent>
        rdf:type                crm:E22_Human-Made_Object ;
        rdfs:label              "Example Painting" ;
        crm:P2_has_type         <http://vocab.getty.edu/aat/300033618> ;
        crm:P46_is_composed_of  <https://linked.art/example/object/child> .

<http://vocab.getty.edu/aat/300435443>
        rdf:type    crm:E55_Type ;
        rdfs:label  "Type of Work" .

<http://vocab.getty.edu/aat/300138075>
        rdf:type    crm:E55_Type ;
        rdfs:label  "Processes and Techniques" .

AS RDF

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:crm="http://www.cidoc-crm.org/cidoc-crm/"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
  <crm:E22_Human-Made_Object rdf:about="https://linked.art/example/object/parent">
    <rdfs:label>Example Painting</rdfs:label>
    <crm:P46_is_composed_of>
      <crm:E22_Human-Made_Object rdf:about="https://linked.art/example/object/child">
        <rdfs:label>Example Painting using Painting term used in the context of a technique</rdfs:label>
        <crm:P2_has_type>
          <crm:E55_Type rdf:about="http://vocab.getty.edu/aat/300033618">
            <rdfs:label>Painting</rdfs:label>
            <crm:P2_has_type>
              <crm:E55_Type rdf:about="http://vocab.getty.edu/aat/300138075">
                <rdfs:label>Processes and Techniques</rdfs:label>
              </crm:E55_Type>
            </crm:P2_has_type>
            <crm:P2_has_type>
              <crm:E55_Type rdf:about="http://vocab.getty.edu/aat/300435443">
                <rdfs:label>Type of Work</rdfs:label>
              </crm:E55_Type>
            </crm:P2_has_type>
          </crm:E55_Type>
        </crm:P2_has_type>
      </crm:E22_Human-Made_Object>
    </crm:P46_is_composed_of>
    <crm:P2_has_type rdf:resource="http://vocab.getty.edu/aat/300033618"/>
  </crm:E22_Human-Made_Object>
</rdf:RDF>
beaudet commented 2 years ago

Once we have member_of collections, the meta-type assertion should be used to determine which collection to validate in which collection the term should exist. e.g. "type of work" term means look in the collection of "types of work" to see if the parent term exists there.

beaudet commented 2 years ago

related to #419

azaroth42 commented 1 year ago

Yeah, ... don't do that :) Either painting should be the object type, or the technique, but not both.

beaudet commented 10 months ago

acknowledged, but regardless of the motivation it seems likely that a classified_as term will be metatyped multiple times and that breaks the ability to round-trip the data in a way that will reconstitute only the metatypes applicable to object where those metatypes are used. In other words, the context where the metatype is being applied isn't really on the term itself, but on the application of that term to the linked art entity. I think we discussed this on the call, but it almost seems like each use of classified_as involving metatypes should mint a new identifier to capture the application of the term to the linked art entity so that metatypes will be scoped properly. Right now, it's a gray area. There can be contradictory metatypes that are valid data representations, but make no logical sense and there will be aggregation of metatypes collected for each term that when exploded in the JSON would, I imagine, duplicate those aggregated terms everywhere the term is used.

azaroth42 commented 10 months ago

The same problem exists anywhere we use vocabulary or any other sort of shared identities with a more specific ontology. You say that the museum is a Place, I say that it's a Group, and someone else is thinking of the physical building... if you aggregate all three classes associated with the same URI into a single graph, you have a huge mess ... but if the URI is ambiguous as to which of those it describes, none of us are individually wrong.

As far as I can tell, the possible solutions are worse than the symptoms:

  1. To manage our own vocabulary and/or make decisions about how existing terms must be used in the context of linked art. For the original example, we could declare that aat:300435443 must only be used with classified_as on a HumanMadeObject, and that aat:300053001 must only be used as the technique (per #561) of an activity. In aggregate that would be an enormous task as we'd be rebuilding AAT.

  2. Come up with new relationships for every potential reuse of a term, e.g. nationality, gender, type_of_work, type_of_part, style, genre, `shape, etc etc all would need new relationships. This is why we went to metatypes in the first place.

  3. Use attribute assignments everywhere to avoid direct assertions. As with any reification approach, it adds a lot of overhead and complexity.

  4. Named Graphs. But then we're replacing one technology problem (triplestores) with another (quadstores).

Creating a new entity every time seems even worse, as there's no way to know where in the big wide world people have used and misused these URIs, and when that data might come into the triplestore. Especially as it's not just metatypes, but basically any assertion. Consider the question about _label (#539) and the type of a resource in my opening :(

We can of course discuss possible solutions, but I don't know of any that don't have huge overheads.

beaudet commented 10 months ago

I think we should talk about possible solutions to maintain the integrity of the graph to be able to round-trip it to RDF. That almost seems like a basic expectation to me given that one of the first things data consumers will probably do is load it into a graph database and run SPARQL against it. I'm not speaking from a place of much experience at all on this, but it doesn't quite feel right to me that meta-statements made in the context of a term's relationship to a single entity would (not so obviously) bleed into any entity using that term. I could understand if the use of metatypes is clearly and intentionally about making global statements about a term but I don't think that's the case.

In other words, any linked art entity can be classified with terms, but rather than a direct classification, there's a node that encapsulates two relationships. So, to me at least, it seems like (3) is the most preferable / viable solution.

HumanMadeObject -> attributeAssignment -> [ classified_as -> term URI, usage_context -> term URI ] ??

Maybe a pattern like this could be mandatory when specifying the term's usage context (aka metatype), but if there's no context to convey, it falls back on the standard pattern (if there's an elegant way to do that).

azaroth42 commented 10 months ago

To try and be clear about my opinion:

The problem is not meta-typing, the problem is at the level of the data. If the data was clean, and there were sufficient and clear vocabulary terms for all of the situations, then we wouldn't have the problem, because the metatypes would never collide. The metatypes are there to ensure that consuming systems don't need to understand the impossibly long list of all types of object, types of work, nationalities, occupations, and so on, but instead understand the metatype to know what sort of classification it is. So, in my opinion, following the same principle as solving API problems in the API, model problems in the model, we should solve data problems in the data.

I think that metatypes do follow this principle: we're solving a problem with the model/ontology (lack of clarity around type of classification) in the model/ontology (with a metatype, which just reuses the existing classification pattern, and without introducing new terms).

Similarly, per @edwardanderson's comment in #539, the same issue applies with _label. Per my response two up, it applies even more dramatically to type. Basically, whenever there is a node with a URI, internal or external, someone somewhere can make inconsistent assertions about it that might get loaded to a triplestore.

It could also be argued that it's a technology problem -- that the issue is with the use of a triplestore as the system of record. With LUX, we don't run into the problem because we don't use a triplestore as the system of record, and we don't use SPARQL as our primary interaction with the data. Any client that uses the JSON as JSON (which I think will be the vast majority of them) will not care either. So implementations using triplestores need to be aware of the problem and to work around it ... but I don't believe that should be a cost for everyone to have to pay.

On the technology side, I consider Linked Art to be an interoperability format or exchange / interchange format, not a database schema. To bastardize Postel's law: be liberal with what you accept, and strict with what you ingest into your system of record. Linked Art isn't the schema for TMS or EMu or Museum+ or any other core system of record. To ingest a Linked Art record into TMS would be a much bigger lift than a Linked Art record into a triplestore to avoid this particular issue. The same solution, some data transformation to suit the technological requirements, would be relatively simple in comparison ... just not part of the specification, but definitely useful as an implementation note.

For the object type / technique case for painting, it can be solved by cleaning up the data, as there are two separate AAT terms for painting (object type) - aat:300033618 - and painting (process) - aat:300054216. Granted that is not always the case, and sometimes there's one or the other but not both. That is a problem with the data and vocabulary layer, not with the model layer.

e.g.

461-diagram

So, if we accept the above, the solution at the data layer or technology layer is "It hurts when I do this ... Don't Do That Then!" but how can we make it easier to not do the painful thing?

azaroth42 commented 10 months ago

Copying the relevant type example from #560:

Some URIs in LUX have the same equivalent with different classes, for example:

wd:Q35794 rdf:type crm:E74_Group .
wd:Q35794 rdf:type crm:E53_Place .

Which is of course impossible in the CRM and would make a terrible mess of a triplestore.

We could discuss a URI level solution, but one which would make interoperability Very Hard at the graph level: Add structured fragments to the end of the URIs to keep them separate in the graph.

e.g. something like:

wd:Q35794#la:quaGroup rdf:type crm:E74_Group ;
  la:specializationOf wd:Q35794 .
wd:Q35794#la:quaPlace rdf:type crm:E53_Place ;
  la:specializationOf wd:Q35794 .

or for the original use case:

aat:300033618#la:quaTechnique rdf:type crm:E55_Type .
beaudet commented 10 months ago

What about using classified_as whenever there's no metatype in play and a different class carrying the metatyping and a term when there is? That would ensure the scope of metatype assertions are limited to the entity being described while enabling a more free-form handling of classified_as data. I'm imagining how an application would make use of both cases. For classified_as it might collect all terms in the result set and try to maintain the relationships between those terms if they are in the same section of a naming authority's hierarchy to infer the metatype. For example, terms appearing under a parent term labeled "types of objects" would be relatively straightforward to unwind and dispensing with metatypes is probably fine. For other situations where application of a term is more semantically nuanced, an Attribute Assignment or equivalent would carry the metatype and all the metatypes could be gathered up together to stitch the data together. That sounds feasible and not particularly complicated and it also would enable round-tripping of data.

Would there be any downside aside from having to restructure metatype assertions? I still think this would work with organizationally selected metatypes and if an organization wanted to be verbose, they could explicitly state the metatypes everywhere. The classified_as would just be limited to applying "keywords" to entities which is what those become semantically without a metatype I think.

azaroth42 commented 10 months ago

Not sure that I follow, sorry. Can you write out an example? Changing the relationships won't help -- if there's any relationship with the same URI as the subject, then it's going to collide due to the nature of triplestores not having any graph boundaries.

The triplestore-friendly structure would be something that asserted:

In the context of , the term is a .

or:

{
  "type": "HMO",
  "local_assertions": [
    {"assigned": "aat:type-of-work", "assigned_to": "aat:painting", "assigned_property": "classified_as"}
  ]
}

And then the query would need to check the document URI, and the assigned_to Type's URI and the assigned_property, to fetch the value of assigned.

beaudet commented 10 months ago

Sure! (and in full disclosure I iterated with GPT on this response).

Current Linked Art Metatyping Pattern Example

Female as Assigned Gender:

{
  "@context": "https://linked.art/ns/v1/linked-art.json",
  "id": "https://linked.art/example/person/12",
  "type": "Person",
  "_label": "Person A",
  "classified_as": [
    {
      "id": "http://vocab.getty.edu/aat/300189557", // Female
      "type": "Type",
      "_label": "Female",
      "classified_as": [
        {
          "id": "https://homosaurus.org/v3/homoit0000078",
          "type": "Type",
          "_label": "Assigned Gender"
        }
      ]
    }
  ]
}

Female as Gender Identity:

{
  "@context": "https://linked.art/ns/v1/linked-art.json",
  "id": "https://linked.art/example/person/13",
  "type": "Person",
  "_label": "Person B",
  "classified_as": [
    {
      "id": "http://vocab.getty.edu/aat/300189557", // Female
      "type": "Type",
      "_label": "Female",
      "classified_as": [
        {
          "id": "https://homosaurus.org/v3/homoit0000571",
          "type": "Type",
          "_label": "Gender Identity"
        }
      ]
    }
  ]
}

Issue in Graph Databases

When these data entries are loaded into a graph database, the context for the term "Female" becomes blurred. In a graph database, the term "Female" is associated with both "Assigned Gender" and "Gender Identity" metatypes, but the database doesn't maintain the distinct association with each individual (Person A and Person B). This leads to a situation where the gender identity and assigned gender of both artists become unclear or falsely stated. The term "Female" gets aggregated across the dataset without maintaining the individual-specific context, leading to potential misinterpretation and loss of data integrity.

This ambiguity underscores the need for a data modeling approach that maintains the specificity and context of each term's application, particularly in a graph database environment. The proposed attribute assignment method addresses this by explicitly tying each term to its intended context within the scope of each individual entity, thereby preserving the clarity and accuracy of the data.

Solution To resolve the issue of context ambiguity in a graph database, we can use the attribute assignment method, ensuring that each use of the term "Female" is explicitly tied to its intended context for each individual. This approach maintains the specificity and clarity of the data. Let's demonstrate this with the same terms "Female," "Assigned Gender," and "Gender Identity" using attribute assignments.

Proposed Attribute Assignment Approach

Female as Assigned Gender:

Here, the term "Female" is explicitly assigned as an "Assigned Gender" for Person A.

{
  "@context": "https://linked.art/ns/v1/linked-art.json",
  "id": "https://linked.art/example/person/12",
  "type": "Person",
  "_label": "Person A",
  "assigned_by": [
    {
      "type": "AttributeAssignment",
      "assigned_property": "classified_as",
      "assigned": "http://vocab.getty.edu/aat/300189557", // Female
      "classified_as": [
        {
          "id": "https://homosaurus.org/v3/homoit0000078",
          "type": "Type",
          "_label": "Assigned Gender Classification"
        }
      ]
    }
  ]
}

Female as Gender Identity:

In this case, "Female" is assigned as a "Gender Identity" for Person B.

{
  "@context": "https://linked.art/ns/v1/linked-art.json",
  "id": "https://linked.art/example/person/13",
  "type": "Person",
  "_label": "Person B",
  "assigned_by": [
    {
      "type": "AttributeAssignment",
      "assigned_property" : "classifed_as",
      "assigned": "http://vocab.getty.edu/aat/300189557", // Female
      "classified_as": [
        {
          "id": "https://homosaurus.org/v3/homoit0000571",
          "type": "Type",
          "_label": "Gender Identity"
        }
      ]
    }
  ]
}

Solution Explanation By using attribute assignments, we effectively encapsulate the context of each term's application within each specific entity. For Person A, "Female" is clearly defined as an assigned gender at birth, and for Person B, it's defined as their gender identity. This method prevents the ambiguity that arises in a graph database when the same term is used under different contexts across multiple entities. Each instance of "Female" is tied to a specific, unambiguous context, maintaining the integrity and clarity of the data. This approach addresses the issue of context loss in graph databases, ensuring that the specific meaning and role of terms like "Female" are preserved for each individual case.

I'm thinking for Linked Art 1.0, we shouldn't have too many prescriptive metatypes but that as time goes on and the number of metatypes used in practice grows, it would probably make sense to measure and adopt the most common ones. That will help pull the various organizational data sets into closer alignment over time.

azaroth42 commented 10 months ago

Ahh, thanks Dave! (and ChatGPT :D)

Agreed that moving the metatype classification on to the attribute assignment solves the problem for metatypes, as the AA is a blank node that's never reused, so there's no collisions with the underlying AAT term.

The current metatypes:

  • Type of Object (on HMO, DigObj)
    • Type of Part (on HMO, DO that are parts of some other HMO/DO) ... but this could be dropped in favor of always using Type of Object, as part-ness is contextual in itself
  • Type of Work (on VI, LO)
  • Style (VI/LO)
  • Shape (HMO)
  • Type of Statement (on all our referred_to_by statements everywhere)
  • Nationality (Person, Group)
  • Occupation / Role (Person, Group)
  • Gender (Person)

The same technique (reification) would also solve the other collisions (e.g. the class for 'pencil' in object isA Type(pencil) vs object made_of Material(pencil)) but at great usability expense, especially for the extremely common case of human readable statements.

{
  "referred_to_by": [
    {
      "type": "LinguisticObject",
      "content": "Oil on Canvas",
      "attributed_by": [
        {
          "type": "AttributeAssignment",
          "assigned_property": "classified_as",
          "assigned": [
            {
              "id": "http://vocab.getty.edu/aat/materials",
              "type": "Type",
              "_label": "Material Statement"
            }
          ],
          "classified_as": [
            {
              "id": "http://vocab.getty.edu/aat/300418049",
              "type": "Type",
              "_label": "Type of Statement"
            }
          ]
        }
      ]
    }
  ]
}

Given the short list of metatypes and their ubiquity in data, I think I'd be happier with minting our own specializations of P2_has_type

{
  "referred_to_by": [
    {
      "type": "LinguisticObject",
      "content": "Oil on Canvas",
      "type_of_statement": {
        "id": "http://vocab.getty.edu/aat/materials",
        "type": "Type",
        "_label": "Material Statement"
      }
    }
  ]
}

and then documenting how to get from la:type_of_statement to the attribute assignment pattern for graph based processing using only CRM properties.

beaudet commented 10 months ago

Sure, that works, although I thought one disadvantage of the mint new relationships approach is:

Come up with new relationships for every potential reuse of a term, e.g. nationality, gender, type_of_work, type_of_part, style, genre, shape, etc., etc. all would need new relationships. This is why we went to metatypes in the first place.

but if the number is relatively small, I think that's the clearest approach.

Should we also recommend avoiding metatyping? From a practical perspective that seems more likely to create annoyances / inconveniences down the road. For example, imagine creating an AI agent that is trained on the use SPARQL and the linked art model and knows how to formulate specific sparql queries in response to a natural language search. It seems much more feasible to do that with SPARQL in a no-code way as opposed to juggling JSON.

azaroth42 commented 10 months ago

My preference is still the status quo (metatypes) and document how to avoid the issue when managing a graph, as it's not just metatypes that have the problem, we also see real world cases of type and _label collisions.

The underlying cause is making assertions (type, _label, and classified_as) about other institutions' URIs and then aggregating those assertions from different sources. If you're inconsistent internally, well, that's your problem :) But even if you're consistent internally about others' data, when your data is merged with someone else who is differently internally consistent, the problem returns.

So, to return to the possible solutions as I see them.

  • Reification. Attribute Assignments everywhere, per your comment. (Doesn't solve the label or type issue, and very verbose)
  • New Properties. Devolve the AA into a property. (ditto, but trades verbosity for a non-interoperable approach outside Linked Art)
  • Named Graphs. Replaces one technology problem with another, and probably doesn't solve everything anyway.
  • New URIs. We could manage our own URI space when collisions are detected, or beg Patricia for new URIs in AAT. Doesn't solve non AAT problems.
  • Modified URIs. Per my comment a few up, we could have a convention of adding fragments to external URIs to align them with the usage to avoid collisions. This solves the problem, but would be very costly to apply across the board, and breaks any hope of inter-graph connectedness. A slight variation on New URIs to have them minted by convention everywhere, rather than by authority in one place.
  • Status Quo. Document the challenge, document the possible solutions, and have adopters pick their internal flavor separately from external interoperability flavor.

Other than the status quo, my second choice would be new properties as they're easy to use and understand in most implementations, and use cases that require interoperability with non Linked Art profile CRM data are going to have to do some transformation work regardless.

azaroth42 commented 10 months ago

To try to catalog the places we use external URIs as the subject of triples:

  • References to anything

    • equivalent
    • about
    • represents
    • influenced_by
  • References to AAT / Concepts

    • classified_as
    • technique
    • broader
    • made_of (Material)
    • unit (MeasurementUnit)
    • language (Language)
    • currency (Currency)
  • References to TGN / Places

    • part_of
    • took_place_at
    • residence
    • current_location
  • References to ULAN / People or Groups

    • carried_out_by
    • participated_in
    • member_of (Group)
    • current_owner
    • transferred_custody|title_to|from
  • Misc References

    • access_point can point to anything (and asserts that it's a digital object)
    • conforms_to

For the references to specific types, we could solve the type and _label problems by only using the URI. e.g. "carried_out_by": ["ulan:rembrandt", "https://local.data/person/12345"] But then we lose most of the benefit of _label as a way to associate some name with the otherwise opaque URI.

For the non-specific typed references, equivalent is always the same class as the thing which has the equivalent property, but about, represents and influenced_by can point to anything. It seems like having access to the class should help implementations know how to process the link, and might even be necessary for it. Similarly for Person/Group it could be valuable to know the class. Given the API should be consistent (e.g. if it's an array of strings, then it's an array of only strings, not sometimes also json objects), I think this would be throwing out far too much value.

azaroth42 commented 10 months ago

Okay ... how about this ...

We decided to allow equivalent in a reference in #439, and I wonder if we can double down on that to solve this issue?

The problem is that we don't want to make conflicting assertions about other institutions' URIs. e.g. that aat:pencil is sometimes a material, and sometimes an object type. Or that female is sometimes a gender expression, and sometimes a biological trait.

We could go further than #439 that allowed adding equivalent to say that the reference might only have equivalent and not also id. Let's say it's the entirely vanilla case of metatypes on statements. Instead of saying that aat:material-statement is classified_as a aat:brief-text, we could say that there's a blank node (or a locally identified node with a URI) that has an equivalent of aat:material-statement which is classified_as aat:brief-text

So, instead of

"referred_to_by": [
  {
    "type": "LinguisticObject",
    "content": "Oil on Canvas",
    "classified_as": [
      {
        "id": "aat:material-statement",
        "type": "Type",
        "classified_as": [
          {
            "id": "aat:brief-text",
            "type": "Type"
          }
        ]
      }
    ]
  }
]

It would be:

"referred_to_by": [
  {
    "type": "LinguisticObject",
    "content": "Oil on Canvas",
    "classified_as": [
      {
        "type": "Type",
        "equivalent": [
          {
            "id": "aat:material-statement",
            "type" : "Type"
          }
        ],
        "classified_as": [
          {
            "id": "aat:brief-text",
            "type": "Type"
          }
        ]
      }
    ]
  }
]

Which doesn't conflict with anything anyone else says, and doesn't add annoying reification. The cost is one more level of join in a query (you'd have to search for all the bnodes with an equivalent, rather than the id directly), but that can be mitigated by assigning a local URI for material-statement and using that instead.

This is a combination of the reification and "new URI" solutions -- that instead of a new shared URI, we don't force the minting of new URIs and instead rely on blank nodes being the external entity in context (reification) with their own (internal, non-dereferencable) identity ... but then aligned with the external URI as the object of a triple, rather than the subject.

beaudet commented 10 months ago

Bravo, there's no longer a round tripping problem nor an "authorship" conflict since external IDs are not modified.

Furthermore, rather than saying that material statements are brief texts, the facts of which could easily be disputed in a global context, you're saying there's a new blank node Type equivalent to an AAT material statement that's classified as a brief text. So it's a surrogate for the AAT term. But why not just classify the Type as a combination of two other types? Less semantically rich? In your case, you're saying it's a material statement in the category of brief text and below says it's a Type that's an amalgamation of them. The Type exists but it's still a blank node with no substitute whereas with equivalent, you can swap out for the AAT term when you need to resolve it. Yeah, ok, that probably makes more sense than below.

This is a somewhat obtuse question, but hypothetically, "should" the linguistic object come up in a search for brief texts modeling it via one approach vs. the other.

  "referred_to_by": {
    "type": "LinguisticObject",
    "content": "This artwork features a unique material composition.",
    "classified_as": [
      {
        "type": "Type",
        "_label": "a brief material statement as text",
        "classified_as": [
          {
            "type": "Type",
            "id": "http://vocab.getty.edu/aat/material-statement"
          },
          {
            "type": "Type",
            "id": "http://vocab.getty.edu/aat/brief-text"
          }
        ]
      },
      ... more types, some simple, some meta
    ]
  }
beaudet commented 10 months ago

(GPT'd for clarity) For tomorrow's meeting agenda concerning the creation of mandatory Linked Art URIs for vocabulary, I recommend proceeding with minting them. The benefits appear significant, and the overall quantity may not be overly burdensome. Utilizing GitHub as a platform for managing this process could be efficient. It would enable us to handle updates through pull requests, facilitating near real-time additions to the vocabulary without needing to frequently update the specifications. This approach also offers an advantage for graphs currently using blank nodes; they could reference a list of equivalent usages to identify and integrate new concepts into the Linked Art vocabulary.

Regarding properties, using an AttributeAssignment as an intermediary node seems to be a more natural fit. By employing 'Type' as the type, we can assign a Type URI to the value and clarify its nature using a 'classified_as' statement. 'Equivalent' appears to function as an alternative for integrating this within 'classified_as'. While this might be suitable given that 'categories' or 'Types' of an entity differ from its properties, it's important to consider the delineation between categories and properties, especially in the context of metatypes. Both categories and properties have classifications under metatypes, which raises the question: Where should we draw the line between categories and properties, and should a category have an applied metatype? Without clear differentiation, categories, especially for entities like 'Person', could become confusing quickly as there can be many. It would help me to have clear guidelines on when to use 'classified as' versus 'AttributeAssignment', assuming this isn't already outlined in our documentation.

bluebinary commented 10 months ago

I think there has been a lot of good discussion on this topic, but I think we definitely need to discuss the above in much more detail before we can collectively come to an agreement on the best way forward.

It would help if all of the truly viable solutions were documented with JSON-LD examples so that all of us (of varying levels of familiarity and skill with JSON-LD, semantics, and the Linked.Art model) could easily compare the different modeling approaches and better understand the implications of the different solutions (some of the approaches noted above do helpfully have JSON examples to go along with them, but not all).

If we were to adopt many of the above solutions (especially those that are least like the status quo), then this could represent one of the biggest changes to the modeling of much of Linked.Art from its inception to 1.0 and for those of us who have already adopted Linked.Art into production and have been making use of it for several years, it would represent a significant amount of effort to adopt to ultimately solve a relatively small issue that I think could be dealt with more readily and effectively downstream of the JSON-LD with better documentation and better tooling.

I think the status quo represents the cleanest and clearest of the above modeling approaches, and if we could collectively better document the issue and include workable downstream solutions for the graph store issue, and (potentially) provide some open source tooling to assist with JSON-LD to triple conversion (i.e. graph expansion) such tooling could remedy the conflicting assertions before they make it into the graph store to begin with.

After all, the issue I believe we are really trying to resolve here is the loss of context from context-specific assertions once those assertions are placed into most graph stores. While the various solutions detailed above that help further contextualize and uniquely identify such assertions certainly help here, as they help prevent global assertions from being made, they clearly cannot be used everywhere, and thus do not entirely solve the issue, and the majority of the suggested solutions seem to be compromises that ultimately reduce the clarity of the JSON-LD for the benefit of the graph store which is just one of many consumers of the data. As such, is it helpful for all of us to have to bear the burden of the technology deficits of current triple store implementations, or can we find a better way to do this?

As most implementers and consumers of Linked.Art are likely to continue interacting with the data primarily as JSON-as-JSON, it does feel like most of the recommendations are imposing significant additional complexity for a relatively small benefit; I certainly want to be sure that we don't make global assertions that don't universally hold to be true, but at the same time most of the solutions seem overly burdensome on the JSON-LD generation side of things, and particularly on the JSON-LD consumption side – both for the primary JSON-as-JSON use case by greatly increasing the complexity of code needed to parse the JSON-LD, and for the graph store case by making the required SPARQL queries that much more complex to write and that much more inefficient to process.

cbutcosk commented 10 months ago

I see the need for some solution here. @beaudet's gender example reads as a pretty definitive demonstration of the issues w/ the status quo re: graph-round tripping of meta-types. Note this is also a problem in pure JSON de/serializers too if they use a kv-based cache layer (a common feature of out-of-band taxonomy enrichers).

That said in most of the cases where users would want to multiply-metatype aat:female I would advise them as @azaroth42 did upthread--that's a new term with its own identity, so it is probably clearer and more manageable to use a new identifier. In the case of domains with inequitable term distribution like gender this seems especially desirable: don't you want other people to use that great new term?

Using a blank node is a choice to hide a node's identity semantics from the user. That's fine in cases like Dave's where he wants to partition those terms without making a URI for whatever reason (easy to imagine that at the low-end of implementations too, where infra complexity might be too much to mint a URI). But (AFAIU) JSON-LD specifies how to identify blank nodes but does not specify an algorithm for blank node generation and deduplication so--buyer beware.

Anyway long way of saying since I can't make the meeting today: +1 to minting term identifiers, this seems more equitable than leaving users and implementers at the mercy of the GVP and Wikidata anyway. +1 to @bluebinary 's point that more examples and scrutinizing would be good, though if more capacious terms are emitted that seems less crucial.

beaudet commented 10 months ago

Great discussion on the call today. I keep thinking this is primarily a data problem. What are the disadvantages of an approach where L.A. version 1.0 allows for both the status-quo metatyping pattern and the equivalence proposal in the API but that we label the pre-1.0 metatyping pattern as deprecated? That would mean a translation layer is required only by systems interacting with the pre-1.0 pattern in whole-graph context.

Are there challenges using the equivalence pattern that we should consider, either using the JSON-LD directly or in a triple store or is it relatively straightforward to make use of that?

That would enable institutions that have already invested considerable time and expense to run a spec-compliant system without modification and let them kick the can down the road until the deprecated pattern is removed from the spec in a later version.

azaroth42 commented 10 months ago

To try and summarize the pros and cons from the call:

  • It is a real challenge that has been encountered in the wild by multiple organizations, in different situations

  • Adopters will expect to be able to load LOD into a triplestore and not generate garbage by following the specification -- a reputation and usability hit, as they also likely won't read the documentation

  • Documenting the differences for SPARQL queries between the model and the local implementation is additional work that can be minimized if the model doesn't introduce necessary local modifications

  • The challenge is at the intersection of technology and data -- the model works just fine if there are unique URIs for different concepts (etc) -- and should thus be solved at one of those two levels

  • The use of equivalent is good, and the notion of it being a proxy is a pattern that's used in other specifications

  • Mandating equivalent rather than id is seen as a step too far, as using AAT (etc) is what people expect to be able to do

  • The transformation (move id to equivalent on load) can be documented and is a small number of lines of code, that doesn't need access to more than a single JSON document at a time. Most systems will need some degree of code to load data, even if that's only expanding the JSON-LD into triples. Easy to document, and the code can be provided, and even available as a web service.

Outstanding question is: Should no id and equivalent be legal in the API? (e.g. this blank node is a proxy for the entities given in equivalent)

azaroth42 commented 10 months ago

IMO, it shouldn't be legal for the following reasons:

  • Freedom from choice -- there should be one way to do things, as much as possible.
  • Cost is high if we allow it, as every processor will need to check both id and equivalent/id for everything. Any path (in json or sparql) that goes to or through a reference would need to process both. That's a lot of joins.
  • Solve at the right layer -- this is promoting a fix through the model and up to the API from the technology layer.
  • Makes determining a reference vs an embedded pattern harder, as references wouldn't necessarily have id any more. This isn't a show stopper, as equivalent would also imply the reference.

So I think MUST have an id and MAY also have equivalent is the right way to go. Worst case scenario, an implementation could skolemize all those blank nodes and be compliant (at the expense of query performance, as above). It encourages implementors to sort out their data and identify the things they actually want to talk about. We can also pursue AAT and other commonly used vocabularies and request more discrete identities for things ... and if we can't get them, we can always mint our own.

To Dave's point about deprecation: We're pre 1.0 still, so have the freedom to change /everything/. To mark something as deprecated in 1.0 would be very weird and we should instead get off the fence and make a decision. Instead, I would say that if it proves to be a real challenge when we have more implementations, then we can deprecate in a 1.1, and remove in 2.0.

beaudet commented 10 months ago

Agree on it being silly to deprecate in 1.0, but if we're going to allow metatyping with id, doesn't that make equivalent redundant? Wouldn't it be just as easy to convert metatypes to blank nodes + equivalents on ingest?

Is there a best practice to suggest to maximize value and minimize the risk of metatyping external ids?

For example, use CAUTION with metatypes:

  • you SHOULD not further classify Types with external IDs
  • you SHOULD instead first look on the linked art site for common pre-minted "metatypes" such as the enhanced material statement classification to avoid having to use a metatype directly in the first place
  • barring that, you SHOULD mint your own classification terms and apply an appropriate equivalent statement in addition to the id of the term. That will best prepare you for a potential future change in the event that metatyping with ids is deprecated.
beaudet commented 9 months ago

As it so happens, I'm generating Concepts for our public data set right now and I have a candidate representation to review.

The TMS data model for concepts has a term type and a cross reference type associated with terms applied to art entities. I'm not sure yet exactly how relevant the term types are to a linked data expression. The values for Term Type are as follows, suggesting to me, at least, that this assignment could potentially indicate that an additional metatype for the term could be applied, but that generally, this is a detail that can probably be skipped for now.

count mnemonic description 1007436 TERM Descriptor 192162 ALT Alternate Term 51701 UF Use For Term 24518 MISC Miscellaneous Source Equivalent 22804 STE Search Term Equivalent 17306 LINK LCSH Link 8778 RT Related Term 1451 POSTCO Post Coordinated Terms 1031 UK British Equivalent 641 UKALT Alternate British 231 ISO 3-Letter ISO Code 51 POSTAL Postal Code

There are also cross reference types as follows which, to me, appear to follow the pattern for metatypes much more clearly than the term types listed above. In fact, some of these metatypes suggest that the term is actually better positioned in other parts of the graph such as place of birth, etc. I think it's probably ok to list all of these under classified_as as well as pulling out specific metatypes for populating other parts of the graph.

count cross_reference_type
166092 School 165580 Keyword 73712 Theme 70548 Media 69887 Technique 55822 Support 47367 Scope 37536 Color 32841 Candidate 24795 Gender 19977 IAD Object Name 18238 IAD Microfiche Catalogue Image 18069 Descriptor 17696 IAD Locale 17617 IAD Class 15435 Loan Object Type 12563 Place Executed 11417 Style 9840 Photography Format 9568 Birthplace 9051 Active Place 9012 Systematic Catalogue Volume 8096 Exhibition History Citation 7847 Collection Place 7845 Publication Place 6813 Death Place 6500 Collection 5983 Production Location 3711 Group Name 2371 Housing Needs 2227 Shape 1543 Venue Status 771 Book Format 223 Catalogue Citation 194 (Not Assigned) 178 Object Type 12 Place Depicted (TGN) 1 Place Depicted

As far as implementation, in order to avoid making statements about the Getty AAT terms directly in our data, I've minted a URI for each metatype as its own concept. In the example below, the NGA's "Active Place" term has the ID listed below. It is equivalent to the Getty AAT place types term and is further classified with active (professional) to suggest that it's a place where professional activity takes place which seems perfectly reasonable to me. Since a new ID is minted for each local metatype and the equivalent and classified_as only appear in the Concept entity rather than other entities that make use of this term, there is never a danger of making statements about Getty terms.

"id": "https://id.nga.gov/895c8031-4529-4349-863c-14645e308bf5",
    "type": "Type",
    "_label": "Active Place",
    "identified_by": [
        {
            "type": "Name",
            "content": "Active Place",
            "classified_as": [
                {
                    "id": "aat:primaryName_term",
                    "type": "Type",
                    "_label": "primary name"
                }
            ]
        }
    ],
    "classified_as": [
        {
            "id": "https://vocab.getty.edu/aat/300393177",
            "type": "Type",
            "_label": "active (professional function)"
        }
    ],
    "equivalent": [
        {
            "id": "https://vocab.getty.edu/aat/300435109",
            "type": "Type",
            "_label": "place types"
        }
    ]
}

Does this look ok to everyone?