is there interest in an analog of SSSOM for NER/CR/text annotation?

cmungall commented 2 years ago

There are a number of different tools that perform NER on text, from bioportal/zooma through to scispacy, @cthoyt's Gilda ( gilda https://www.biorxiv.org/content/10.1101/2021.09.10.459803v1.full )

These all vary in their output but are some variant of text span location and ID plus metadata for the matched concept.

While the entity normalization step of NER could be seen as term matching, I think this is out of scope for SSSOM. However, I think it would make sense to have a SSSOM analog, where the SSSOM metadata element URIs are reused.

In fact I did a very quick and dirty first pass at this:

https://incatools.github.io/ontology-access-kit/datamodels/text-annotator/index.html https://github.com/INCATools/ontology-access-kit/blob/main/src/oaklib/datamodels/text_annotator.yaml

I think it would be useful to standardize on this, for applications like our https://github.com/monarch-initiative/ontorunner that wrap multiple different annotators for aggregating results, cc @hrshdhgd

cc @graybeal

matentzn commented 2 years ago

Cool! Sounds interesting! No comment at the moment, but I think this looks useful.

cthoyt commented 2 years ago

Gilda is explicitly not an NER tool - it only does named entity normalization, which means you already have a piece of text that is representing a named entity, and it figures out a grounding for it. Unfortunately, this is a very common misconception. I am a bit confused about what you mean by this issue since I think we have a different understanding of some of the vocabulary that's used here

Update, we actually implemented NER in Gilda v1.0 which was released on June 30th, 2023

cmungall commented 2 years ago

ah that was just my misconception about gilda. The goal here is to represent the full CR step - NER plus grounding/normalization

jonquet commented 2 years ago

NER tools usually relies on other standards e.g., BRAT (https://brat.nlplab.org/standoff.html)
Ultimately the W3C has a standard for representing "Annotations", the Web Annotation Data Model (https://www.w3.org/TR/annotation-model/)
There is a frequent confusion with the words 'mapping' and 'annotation' coming from the fact that some would consider "mapping text to entities" as a mapping and other as an annotation. I would avoid adding to this confusion by making SSSOM too flexible to represent other things than "ontological mappings" as its name indicate.

For these 3 reasons, I would not develop such an interest.

cmungall commented 2 years ago

OA seems much more general

Practical use case: the use of a standard format for genome coordinates (GFF) has allowed lots of different datasets and browsers (eg JBrowse) to be combined. It would be nice to have something similar for text annotations so that we could use the same markup (e.g. the nice spacy markup) with different annotators. It would also be nice if this were a modern JSON based serialization or a well-behaved TSV with a defined datamodel, a schema that can be used for validation, optionally with datamodel elements mapped to IRIs

cmungall commented 2 years ago

and to be clear: this is out of scope for SSSOM. I am exploring interest in an analog

gaurav commented 1 year ago

I like the PubAnnotator JSON format for storing annotations, but it doesn't have a very rich way of representing what we know about a particular annotation. Something that connects that with SSSOM or with the Web Annotation Data Model would be nice, I think! I don't know how popular PubAnnotator JSON is, or how it's organized, so it might be worth reaching out to them first and see if they'd be interested in something like this.

The use-case we used this for is that we used SciGraph to annotate text, stored it in one track in PubAnnotator, and then normalized those identifiers using the Translator Node Normalizer, which we stored in another track. But having some sort of layered labeling -- "the original text is 'cerebro', NER says that it is MESH:1234 'brain', NodeNorm says that this is UBERON:2345 'brain'" -- could be useful here.

cmungall commented 1 year ago

While waiting on this, we went ahead and defined a "profile" of SSSOM for text annotator results:

https://w3id.org/oak/text-annotator

This is actively used in OAK

cmungall commented 1 year ago

From @matentzn : https://mapping-commons.github.io/sssom/LiteralMapping/

croussey commented 1 year ago

There exist several model for annotations Lemon, Open Annotation see two state of the art in Semantic Web journal https://www.semantic-web-journal.net/system/files/swj1909.pdf https://www.semantic-web-journal.net/system/files/swj2859.pdf the main point is to link the process description to the output in order to be able to compare the results I would more try to find a way to merge or translate in one to another, those existing models than create a new one.

matentzn commented 1 year ago

@croussey thank you for reaching out!

I looked at https://www.w3.org/2019/09/lexicog/, for example, and https://www.w3.org/TR/annotation-model/, as you suggest. They, and others mentioned by the surveys you shared, seem like great resources.

My personal position (not speaking for anyone else) is that what we need to provide here is something so much simpler than any of these models are designed for - providing a way to share SSSOM mappings where the subject is a literal. It is taking so much effort now to organise all this trainings, tutorials, presentations etc that it would be strange to say: use SSSOM for this one use case (entity mappings), but if it is this other, tiny-little different use case (entity mappings where entity is a string), use this entirely different system.

All SSSOM presentations start the same way: Standardisation is a decentralised efforts, and at all times a number of competing standards emerge. This is not bad. It is so much easier to convert one standardised format into another, then to convert no-standardised data in to a standardised form. We could write adapters for any of the systems you are linking to, and I am happy to accept a PR that explains when and why to prefer each of these over SSSOM - I really don't care if people use the one or the other, as long as they standardise their data and publish it according to Open FAIR Data principles.

What do you think? I am fine to be contradicted of course!

croussey commented 1 year ago

I share the point of view of Clement. Annotation is not the same as mapping between entities. A text (full document) can be interpreted in different ways and any annotators may have a specific point of view that justify the annotation. Annotation is a process for different purposes. So there is no right and false annotations, it depend of the use case. That why I like W3C OA model. In annotation the main point is to describe the processes in order to understand why the annotation exist. An entity limiting to a string is not a good entity because it is not interpretable, we need context to understand the entity (definition, or graph neighboring, ...). Concerning mapping between entities, to my point of view because entity should be described in a semantic resource that provide the context, we could say that there is good and false mappings.

Another point I do not find the exiting annotation model complex: OA model is not complex. Thus I would not say that SSSOM is simple than others... as usual, it depend how easy you can interpret the model. We use SSSOM for some entities mappings and we have to reinterpret the model for our purpose in some example. The paper will be available soon https://www.frontiersin.org/articles/10.3389/frai.2023.1188036/abstract

matentzn commented 1 year ago

The paper will be available soon https://www.frontiersin.org/articles/10.3389/frai.2023.1188036/abstract

Super exciting work! Cool!

An entity limiting to a string is not a good entity because it is not interpretable, we need context to understand the entity (definition, or graph neighboring, ...)

This is of course correct. A literal in this sense is not perceived as an entity, but more like a synonym in the thesaurus sense. And this (associating synonyms to entities) does not at all cover the OA scope - I think for a full model as Chris suggests up top, we indeed have to look at OA! But this is, as you say yourself, totally out of scope for SSSOM.

TBH, technically speaking you are probably right, and I feel bad arguing against you - there is a degree of "laziness" here in the decision of adding the SSSOM literal profile, or to phrase it more positively, a lack of resources to familiarise integrate OA or similar into our toolings, trainings etc. I stand firm on the assertion though that we can integrate these after the fact.

Would you be willing to write a paragraph for the SSSOM docs pointing people that seek to publish "literal mappings" to look at OA first, and explain how they can distinguish broad/narrow/exact using it?

croussey commented 1 year ago

with pleasure, we will have another article from french D2KAb project that will provide 3 examples of OA uses in agronomical context. Could you let me know where to add the parapagraph about OA documentation. Maybe I could write if in the git issue and let you copy past where you want...

matentzn commented 1 year ago

@croussey Please feel free to rework this file in the documentation to help people picking the right standard: https://github.com/mapping-commons/sssom/blob/master/src/docs/sssom-profiles.md

mapping-commons / sssom

is there interest in an analog of SSSOM for NER/CR/text annotation? #155