lcnetdev / bibframe-ontology

Repository for versions of BIBFRAME ontology.
http://www.loc.gov/bibframe/
48 stars 7 forks source link

Ontology is unusable without domains and ranges #121

Open azaroth42 opened 1 month ago

azaroth42 commented 1 month ago

Rather than making snarky comments on the myriad issues, I'll open a high level issue for the problems that are being caused by them.

By reducing every domain and range to rdfs:Resource, you have destroyed any usability or interoperability of the ontology to the point where it's completely worthless. Why? Because rdfs:Literal is a subClass of rdfs:Resource.

So all of those properties like agent, carrier, language, place, etc etc can all be either an entity of any class or a literal value of any type. A language with a value of a date? No problem! A place that's a Concept ... sure, why not!

At this point, all the ontology actually provides is a flat list of names of properties and classes that implementers can choose from and mix and match freely to their hearts' content. Two implementations that follow the ontology that take diametrically opposed approaches to almost any modeling choice are both completely valid, and thereby interoperability is gone. This makes usability by software engineers either wonderful (no constraints, so everything is correct) or impossible (no constraints, so everything has to be tested for, which is impossible across arbitrary data).

timathom commented 1 month ago

@azaroth42, thanks for shining a light on this issue—you are spot on! I'm reminded of something I heard Nicola Guarino say: "Interoperability is not compatible with underspecification. […] A well-founded computational ontology is a specific artifact expressing the intended meaning of a vocabulary in a machine-readable form." As an ontology, BIBFRAME is underspecified, and that is bound to be a barrier to adoption.

Your point about rdfs:Resource is not only an argument for interoperability: it also exposes a specific bug in the ontology.

To Reproduce

  1. Load the bibframe.rdf file into Protégé.
  2. Examine the Object properties and Data properties tabs. Properties such as subject (and the 30 other object properties that have rdfs:Resource as their declared range) appear in both tabs. The rdfs:Resource range appears only with the datatype property version.
  3. Save the ontology from Protégé to disk.
  4. Reload the ontology into Protégé. After loading, 31 warnings are logged, of the form:

Illegal redeclarations of entities: reuse of entity http://id.loc.gov/ontologies/bibframe/subject in punning not allowed [Declaration(ObjectProperty(http://id.loc.gov/ontologies/bibframe/subject)), Declaration(DataProperty(http://id.loc.gov/ontologies/bibframe/subject))]

The declaration of rdfs:Resource as the range of an owl:ObjectProperty forces the property to be punned as an owl:DatatypeProperty (because rdfs:Literal is a subclass of rdfs:Resource), and this violates the OWL 2 rules for punning:

OWL 1 DL required a strict separation between the names of, e.g., classes and individuals. OWL 2 DL relaxes this separation somewhat to allow different uses of the same term, e.g., Eagle, to be used for both a class, the class of all Eagles, and an individual, the individual representing the species Eagle belonging to the (meta)class of all plant and animal species. However, OWL 2 DL still imposes certain restrictions: it requires that a name cannot be used for both a class and a datatype and that a name can only be used for one kind of property.

Fixing the bug requires removing the (rdfs:Resource) range from all object properties, which will at least address the extreme case of allowing a language to have a value of a date.

The broader issue of underspecification is still one that needs to be addressed—although data validation is a separate issue from conceptual modeling, and some of the interoperability issues are bound to be solved through community consensus, application profiles, SHACL shapes, etc.

Finally, I also can't help myself, as you say, Rob 😄

I was reminded of your 2015 report, Analysis of the BIBFRAME Ontology for Linked Data Best Practices, namely, section 2.4.5, "Only Define What Matters":

It is tempting to specify as much as possible about new terms in an ontology and lock it down with a constricting range and domain. This should be avoided unless the relationship is only ever useful with the particular class of resource, as it prevents others from reusing the terms when the general semantics are appropriate, but the exact use is not the same context as the reason for its creation. It also leads to difficult-to-trace inferences as the range and domain of properties in RDF are not validation constraints, instead they provide additional information about the resource. If a property has the domain of a Person, and it is used with a Cat, then this is not an error. Instead processing systems will incorrectly infer that the Cat is also simultaneously a Person. This is specified in the description of the RDF Schema ontology for both rdfs:domain and rdfs:range. Further, this applies within an institution and ontology. Defining predicates that can be used in multiple situations based on the semantics of the relationship, not the features of the subject or object, make an ontology significantly simpler and easier to maintain, thereby increasing adoption.

Of course, that was nearly 10 years ago, but the line of thinking expressed there was influential at the time and, ironically, paved the way for the modeling decisions that you now rightly critique!

azaroth42 commented 1 month ago

Thanks Tim :) My 2015 understanding of good practice and the 2024 understanding are definitely different on that particular topic, and the differences (I think) come from the renewed emphasis in the ontology world on foundational models, rather than domain specific ones. I'm surprised you still have a copy of the report!

The challenge that 2.4.5 (and 2.4.1) was attempting to address was the proliferation of predicates. We can see the end result of going down this line in the 1100 predicates of RDA. We see it rearing its ugly head already with the proliferation of organization based classes such as ShelfMarkNlm and the proposed OCLC in #120. BF 2.x is much much better than 1.0 in these regards, but we can still improve further. I agree with the final sentence quoted above still, but the approach to get there is to have broader semantics with a deeper conceptual class hierarchy.

To take #19 as the example... the root cause is an incomplete definition (and thereby understanding) of "location". To say that "online" can be a Place expands the notion of location from purely physical/spatial into digital. Or potentially into the abstract to say that the location is "in storage" (a state, or at best a classification of the intended use of a Place). Rather than understand how to improve the model, instead the existing relationships have been broadened far beyond their initial intent. We don't want "placeGeographic", "placeObject", "placeConceptual", "placeDigital" -- that would be falling into the same trap as we got out of in 2.0 of having every predicate also have the name of the range class embedded in it. Instead, there should be a digital thing class that can have a locator in digital space. Physical objects can be at a location in geographic space, or related to some other object (letter is in the folder, folder is on the shelf).

Conversely, it doesn't make sense to say that Concepts or Works have a location, so these would need to be in a different branch of the class hierarchy. Other interesting cases would be the beginning of existence of a thing and partitioning of things.

2.4.1 (reuse vocabulary terms) is where I think my understanding has changed the most significantly over the past decade with experience of Open Annotation, BF, IIIF, CIDOC CRM, RICO and Linked Art. Grafting predicates between conceptual models risks unintentionally importing undesirable semantics. Better to have a single core conceptual model with a strong ontology that allows sufficient scope for reuse across domains.

jimfhahn commented 1 month ago

The Guarino quote is interesting, it called to mind the work Guarino & Welty did on 'ontology cleaning' back in the day (circa 2000). Guarino & Welty used symbolic modal logic to suggest a backbone architecture to meta properties for ontology development. A nice exercise was was published in 2007 that applied one part (roles) of the ontology cleaning method to FRBR (https://experts.illinois.edu/en/publications/three-of-the-four-frbr-group-1-entity-types-are-roles-not-types ) but the authors, in a stroke of deep generosity to the profession suggested that

..., roles are well suited for representing change, but where change is logically impossible, physically impossible, or even just highly unlikely, the advantage of converting types to roles in a conceptual model, as opposed to a general ontology, may perhaps be slight and add unnecessary complexities to cataloguing practice and system design. A conceptual model such as FRBR, which might be called a denormalized ontology, reflects this reality. For the problems of bibliographic control it would seem that the actual world is perhaps world enough.

Indeed. For the problems of bibliographic control, it would seem that the actual world is perhaps world enough.

To advance generously, then, we might view BIBFRAME as a denormalized ontology. I like to think of BIBFRAME as bringing forth a rich bibliographic cataloging tradition into linked data. Svenonious wrote in the preface to the Intellectual Foundation of Information Organization on the aims of the book that

much of the literature that pertains to the intellectual foundations of information organization is inaccessible to those who have not devoted considerable time to study the disciplines of cataloging, classification, and indexing. It uses a technical language, it mires what is of theoretical interest in a bog of detailed rules and it is widely scattered in diverse sources such as thesaurus guidelines, codes of cataloging rules, introductions to classification schedules, monographic treatises, periodical articles and conference proceedings.

I often look to my colleagues for help in making sense of a beautifully pragmatic cataloging practice. I count as colleagues those who have guided the changes we see in BIBFRAME today. I support their work; pragmatically and empirically, BIBFRAME is used in disparate systems and interoperates just fine. I've presented a little on useful approaches, but they aren't the only way to interoperate, though, like the actual world, it is enough.

timathom commented 1 month ago

Thanks Tim :) My 2015 understanding of good practice and the 2024 understanding are definitely different on that particular topic, and the differences (I think) come from the renewed emphasis in the ontology world on foundational models, rather than domain specific ones. I'm surprised you still have a copy of the report!

@azaroth42, yes, speaking of digital space... :) it looks as though I recovered a copy from Google Drive in 2018!

I agree with the final sentence quoted above still, but the approach to get there is to have broader semantics with a deeper conceptual class hierarchy.

To take #19 as the example... the root cause is an incomplete definition (and thereby understanding) of "location". To say that "online" can be a Place expands the notion of location from purely physical/spatial into digital.

Great examples. In addition to space, BIBFRAME also lacks a theory or model of time. It's all practice (porting MARC 21), and no theory. The lack of attention to definitions (compared to more robust models) is another case in point. In BIBFRAME, time is reduced to a unidimensional date datatype property, presumably with the assumption that EDTF literals will be used to express ranges and approximate dates, but the flattening effect really limits expressivity.

2.4.1 (reuse vocabulary terms) is where I think my understanding has changed the most significantly over the past decade with experience of Open Annotation, BF, IIIF, CIDOC CRM, RICO and Linked Art. Grafting predicates between conceptual models risks unintentionally importing undesirable semantics. Better to have a single core conceptual model with a strong ontology that allows sufficient scope for reuse across domains.

Right, alignment with upper-level ontologies or conceptual models can do more toward advancing interoperability. The LRMoo approach looks appealing, though I need to study the model more closely.

timathom commented 1 month ago

A nice exercise was was published in 2007 that applied one part (roles) of the ontology cleaning method to FRBR (https://experts.illinois.edu/en/publications/three-of-the-four-frbr-group-1-entity-types-are-roles-not-types )

Thanks, @jimfhahn, for the references--I'll take a look! I do appreciate where you're coming from, although you may be romanticizing cataloging practice a bit :) I speak from experience, having worked in the MARC 21 milieu for a few years. The problem, to me, is that BIBFRAME seems to be modeling the world of cataloging practice rather than the actual world itself. Within the cataloging community, shared rules and norms are probably enough to drive interoperability through consensus. But if we are interested at all in interoperability broadly speaking, we should think about how to model our data in a way that's coherent and well defined. Projects like openWEMI are also interesting in this regard.