information-artifact-ontology / ontology-metadata

OBO Metadata Ontology
Creative Commons Zero v1.0 Universal
19 stars 8 forks source link

Record the fact that a statement is somehow auto-generated? #172

Open gouttegd opened 5 months ago

gouttegd commented 5 months ago

Several ontologies contains annotations that have not been manually curated/edited, but are instead the result of some kind of automatic generation process.

For example, FlyBase’s Drosophila anatomy ontology (FBbt) contains classes whose text definition has been automatically generated from the logical definition of the class (by “translating” the class expression the class is equivalent to into plain English).

We can also expect to see more annotations that are the result of some LLM-assisted process.

I think it would be useful if this kind of auto-generated contents could be explicitly flagged as such, for at least two reasons:

1) Basic honesty. There is an implicit assumption that an ontology is the result of the work of human curators who know what they are doing. Users have the right to know when a part of an ontology is instead the result of an automated process involving no actual (human) curation.

2) Provide a way for LLM folks to avoid using auto-generated content when they collect training data, to avoid a situation where the next generation of LLM is trained on the output of the previous generation (it could be that this horse has already left the barn; still, doesn’t mean we shouldn’t try to avoid making things worse).

In the aforementioned FBbt ontology, automatically generated definitions are annotated with a oboInOwl:hasDbXref annotation with the special value FBC:Autogenerated (where FBC stands for “FlyBase Curator”). It’s better than nothing but it’s obviously a local, ad-hoc solution. A standard, uniform way to flag auto-generated statements would be better.

Several possibilities:

a) A simple annotation with a new property that takes a boolean value (something like OMO:is_autogenerated=true) and merely indicates whether the statement the annotation is applied to is, well, auto-generated.

b) An annotation with a new property that takes either a string or (preferably) an IRI, and that indicates both: 1) the fact that the statement is auto-generated and 2) some information about how the content was generated, for example with an IRI that identifies the generating process (something like OMO:generator=http://example.org/my/text/definition/generator).

c) Defining some special values to use with existing properties such as dc:contributor or dc:source (@matentzn ’s idea; something like dc:contributor=openai:gpt4). Does not involve any new property but implies that one must look at the value of the annotation to possibly know that the content is auto-generated.

Thoughts?

cmungall commented 5 months ago

It would be great to have more metadata here.

Another use case is axioms added by a reasoner. The robot reason command adds an is_inferred annotation. This is better than nothing, but doesn't answer questions like:

We also previously discussed having complete PROV graphs with very clear provenance:

This may seem like overkill, and there are the usual objections about not having individuals cluttering ontologies, but I think it is worth doing this right, with a full data model.

There is sometimes a blurry line between annotations (in the bio sense) and ontology axioms. I personally follow Rector et al and believe there needs to be a firm dividing line here and we should not capture annotations in OWL. But it can be convenient, and the horse may have left the barn:

There are of course existing data models for annotations, such as the GO evidence model, and biolink.

It's really important to be precise about sources of axioms, whether auto-generated or not, and this will be increasingly important.

But as a stopgap measure until we have full PROV graphs, what's wrong with having hasDbXref axiom annotations, and having standard conventions for the object (e.g. the LLM pipeline used)? Using dbxref axiom annotations is already a standard that has been in use for 20 years and is well understood by tools.

cmungall commented 5 months ago

I realize dc:contributor=openai:gpt4 is just an example but I would recommend always linking to the specific pipeline or tool used (ideally with metadata e.g sources used for RAG). Hopefully people are not using chatgpt directly and are instead doing this through RAG pipelines or tools like Consensus.

matentzn commented 5 months ago

Very nice issue, I love it and I think this is very important. I would love the PROV-solution in the ticket you shared.

  1. Adopting prov:wasGeneratedBy universally in all ontologies for everything. I love the ideas behind PROV even if they may look unwieldy at first glance
  2. Introduce PURL subdomains for ontologies that describe prov:Activity instances for that ontology, e.g. http://purl.obolibrary.org/obo/mondo/generation/robot1287896386. For non-OBO activities, you any PURL to an activity description would do. The activities provide detailed descriptions about about the process that generated the axioms, like its sources, tools and their versions, etc.
  3. Have each axiom in the ontology link to one of these.

I know this sounds impractical, but it is also beautiful. The main downside is that our ontologies get cluttered with a lot of provenance information, but we could perhaps agree on a scheme that does not include the serialised PROV graph in the ontologies but have the activity purls resolve to it.

Dreaming. I really believe in this because I think that provenance will be the main selling point for declarative forms of knowledge in the age of AI.

gouttegd commented 5 months ago

There is sometimes a blurry line between annotations (in the bio sense) and ontology axioms. I personally follow Rector et al and believe there needs to be a firm dividing line here and we should not capture annotations in OWL.

I am not sure I understand what you mean by “annotations (in the bio sense)”. And likely because of that, I don’t understand what this has to do with the issue at hand.

But as a stopgap measure until we have full PROV graphs, what's wrong with having hasDbXref axiom annotations, and having standard conventions for the object

The fact that precisely, we do not have standard conventions for the object. The FBC:Autogenerated used by FlyBase is nothing more than a local convention that has no value outside of FlyBase. It would not make sense to generalise it to other ontologies. The FBC pseudo-prefix explicitly stands for FlyBase Curator; it stems from the practice, dating from before ORCID was a thing, of attributing statements to individual curators identified by their initials (e.g. FBC:DOS for a statement attributed to David Osumi-Sutherland).

[About using PROV] I know this sounds impractical, but it is also beautiful

I agree about the beautiful part. I’d be happy with such a solution, except for one bit. It does not provide a simple, direct way to get the information that a statement had been automatically generated without human input (which is what I want to do here).

Unless I missed something (I’ll admit I only briefly skimmed the PROV spec for now), prov:wasGeneratedBy does not necessarily imply that the generation process was an automatic process. A prov:Activity can refer to any kind of “activity”, it can a manual curation/editing activity or a fully automatic, reasoner- or LLM-driven activity.

So if all axioms in the ontology are annotated with prov:wasGeneratedBy annotations (each pointing to the activity that produced it), when someone will want to know if a given statement/axiom has been automatically generated (which, again, is what I want to allow here), they will need to either:

*) have an out-of-band knowledge of which activities correspond to automatic processes (e.g. a list of all prov:Activity IDs known to represent such processes; it’s hard to see how such a knowledge could be compiled and even harder to see how it could be kept up-to-date);

*) explore the prov:Activity associated with the axiom (which implies that it should either be provided with the ontology, or we should have a way to know where to get it) and hope that the model describes the activity sufficiently enough that we can distinguish between manual activities and automatic activities.

Ultimately it should be doable, but I can’t help thinking we are envisioning a complex solution that will maybe provide many potentially useful informations but will fail to provide easily the one bit of information that, for now, we know that we want to provide.

gouttegd commented 5 months ago

hope that the model describes the activity sufficiently enough that we can distinguish between manual activities and automatic activities.

I like the idea of something like the ROBOPROV ontology outlined in the document linked in https://github.com/ontodev/robot/issues/6, and I think such an ontology would be absolutely necessary if we want to be able to describe our “activities” with enough details (PROV on its own seems largely insufficient), but I think it should not be focused specifically on ROBOT. If we have to design such an ontology, we should make it cover not only ROBOT but also the OAK, and possibly the ODK as well.

gouttegd commented 5 months ago

So if all axioms in the ontology are annotated with prov:wasGeneratedBy annotations

Wait, prov:wasGeneratedBy is an object property. You can’t annotate an axiom with an object property, can you?

matentzn commented 5 months ago

I can only say this: https://github.com/information-artifact-ontology/ontology-metadata/issues/90

The semantic web was designed for instance level assertions, not class level assertions..

gouttegd commented 5 months ago

I don't see any way forward other than what you say: re-type.

But doesn’t that imply that any ontology in which we would use a prov:wasGeneratedBy re-typed as an annotation property can never be merged (e.g. imported into) an ontology that happens to use prov:wasGeneratedBy with its original object property type?

cmungall commented 5 months ago

I am not sure I understand what you mean by “annotations (in the bio sense)”

https://incatools.github.io/ontology-access-kit/glossary.html#term-Annotation https://incatools.github.io/ontology-access-kit/guide/associations.html#associations

cmungall commented 5 months ago

The semantic web was designed for instance level assertions, not class level assertions..

I'd say the semantic web rdf world is completely fine with class level assertions, classes are instances of classes in RDFS. It was OWL1 that insisted that classes aren't in the domain of discourse creating the awful hack of "annotation properties" to get around this. They backtracked a bit with OWL2 but punning is fundamentally strange and confusing to 99% of people, same with OWL-Full.

matentzn commented 5 months ago

The decision that APs / OPs punning is illegal is really one of the most annoying design decisions. In Protege you cant even select an OP when annotating an axiom.

But doesn’t that imply that any ontology in which we would use a prov:wasGeneratedBy re-typed as an annotation property can never be merged (e.g. imported into) an ontology that happens to use prov:wasGeneratedBy with its original object property type?

@gouttegd yes, this is the super annoying downside. They can be merged on RDF level, but not on OWL level (e.g. imported using OWL API-based tooling, or processed using tools sensitive to the, as Chris says, often annoying limitations of OWL-Full).

The alternative is to never re-use any already existing properties, which I believe is worse. The "typing" is just a design flaw, and since RDF level integration is totally fine either way, I would vote for re-typing.

gouttegd commented 5 months ago

Crazy idea (not sure I would vote for it myself, but just thinking out loud):

Instead of re-typing (that is, use the IRI of a standard OP as if it was an AP), how about defining new IRIs (in a dedicated namespace) for annotation properties that “mirror” the object properties we would like to use, with explicit mappings between each new AP and its original OP counterpart?

That is, if we’d like to use, for example, prov:wasGeneratedBy on an axiom (where we’d need an AP), we create a xxx:wasGeneratedBy annotation property that is explicitly mapped (e.g. with a skos:exactMatch) to prov:wasGeneratedBy.

That’s akin to “never re-use any already existing (object) properties”, yes. But at least the new properties would not come out of the blue and would not be reinventing the wheel – they would follow the existing properties.

Again, not sure I believe myself this is a good idea, but WDYT?

matentzn commented 5 months ago

Instead of re-typing (that is, use the IRI of a standard OP as if it was an AP), how about defining new IRIs (in a dedicated namespace) for annotation properties that “mirror” the object properties we would like to use, with explicit mappings between each new AP and its original OP counterpart?

This is not a crazy idea, its neat, but the main problem remains: creating a burden for users to integrate at RDF level.

Previously we (in this case I will take responsibility) decided the tradeoff between (1) re-typing and OWL violations on the one side, and (2) using different IRIs requiring mappings and churn on the other (data integration side) in favour of (1) as the lesser of two evils.

Check this:

<rdf:Description rdf:about="#exactMatch">
    <rdfs:label xml:lang="en">has exact match</rdfs:label>
    <rdfs:isDefinedBy rdf:resource="http://www.w3.org/2004/02/skos/core"/>
    <skos:definition xml:lang="en">skos:exactMatch is used to link two concepts, indicating a high degree of confidence that the concepts can be used interchangeably across a wide range of information retrieval applications. skos:exactMatch is a transitive property, and is a sub-property of skos:closeMatch.</skos:definition>
    <!-- S38 -->
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#ObjectProperty"/>
    <!-- S42 -->
    <rdfs:subPropertyOf rdf:resource="#closeMatch"/>
    <!-- S44 -->
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#SymmetricProperty"/>
    <!-- S45 -->
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#TransitiveProperty"/>
    <!-- S46 (not formally stated) -->
    <rdfs:comment xml:lang="en">skos:exactMatch is disjoint with each of the properties skos:broadMatch and skos:relatedMatch.</rdfs:comment>
    <!-- For non-OWL aware applications -->
    <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>
  </rdf:Description>

in here: https://www.w3.org/TR/skos-reference/#namespace-documents

For the convenience of tools and applications that wish to work within the constraints of OWL DL, the SKOS RDF Schema - OWL 1 DL Sub-set [SKOS-RDF-OWL1-DL] provides a modified, informative, schema which conforms to those constraints. Note that this schema is obtained through the deletion of triples representing axioms that violate OWL DL constraints. Alternative prunings could be performed.

It will be massive churn to create parallel hierarchies in the way you propose for all vocabularies we re-use where the originators, for some reason, thought it was a good idea to model the metadata properties in OWL..

I personally think should not create different IRIs. I value the fact that we can integrate at RDF level higher than maintaining conceptual separation.