Open cmungall opened 2 years ago
Cool! Sounds interesting! No comment at the moment, but I think this looks useful.
Gilda is explicitly not an NER tool - it only does named entity normalization, which means you already have a piece of text that is representing a named entity, and it figures out a grounding for it. Unfortunately, this is a very common misconception. I am a bit confused about what you mean by this issue since I think we have a different understanding of some of the vocabulary that's used here
Update, we actually implemented NER in Gilda v1.0 which was released on June 30th, 2023
ah that was just my misconception about gilda. The goal here is to represent the full CR step - NER plus grounding/normalization
For these 3 reasons, I would not develop such an interest.
OA seems much more general
Practical use case: the use of a standard format for genome coordinates (GFF) has allowed lots of different datasets and browsers (eg JBrowse) to be combined. It would be nice to have something similar for text annotations so that we could use the same markup (e.g. the nice spacy markup) with different annotators. It would also be nice if this were a modern JSON based serialization or a well-behaved TSV with a defined datamodel, a schema that can be used for validation, optionally with datamodel elements mapped to IRIs
and to be clear: this is out of scope for SSSOM. I am exploring interest in an analog
I like the PubAnnotator JSON format for storing annotations, but it doesn't have a very rich way of representing what we know about a particular annotation. Something that connects that with SSSOM or with the Web Annotation Data Model would be nice, I think! I don't know how popular PubAnnotator JSON is, or how it's organized, so it might be worth reaching out to them first and see if they'd be interested in something like this.
The use-case we used this for is that we used SciGraph to annotate text, stored it in one track in PubAnnotator, and then normalized those identifiers using the Translator Node Normalizer, which we stored in another track. But having some sort of layered labeling -- "the original text is 'cerebro', NER says that it is MESH:1234 'brain', NodeNorm says that this is UBERON:2345 'brain'" -- could be useful here.
While waiting on this, we went ahead and defined a "profile" of SSSOM for text annotator results:
https://w3id.org/oak/text-annotator
This is actively used in OAK
From @matentzn : https://mapping-commons.github.io/sssom/LiteralMapping/
There exist several model for annotations Lemon, Open Annotation see two state of the art in Semantic Web journal https://www.semantic-web-journal.net/system/files/swj1909.pdf https://www.semantic-web-journal.net/system/files/swj2859.pdf the main point is to link the process description to the output in order to be able to compare the results I would more try to find a way to merge or translate in one to another, those existing models than create a new one.
@croussey thank you for reaching out!
I looked at https://www.w3.org/2019/09/lexicog/, for example, and https://www.w3.org/TR/annotation-model/, as you suggest. They, and others mentioned by the surveys you shared, seem like great resources.
My personal position (not speaking for anyone else) is that what we need to provide here is something so much simpler than any of these models are designed for - providing a way to share SSSOM mappings where the subject is a literal. It is taking so much effort now to organise all this trainings, tutorials, presentations etc that it would be strange to say: use SSSOM for this one use case (entity mappings), but if it is this other, tiny-little different use case (entity mappings where entity is a string), use this entirely different system.
All SSSOM presentations start the same way: Standardisation is a decentralised efforts, and at all times a number of competing standards emerge. This is not bad. It is so much easier to convert one standardised format into another, then to convert no-standardised data in to a standardised form. We could write adapters for any of the systems you are linking to, and I am happy to accept a PR that explains when and why to prefer each of these over SSSOM - I really don't care if people use the one or the other, as long as they standardise their data and publish it according to Open FAIR Data principles.
What do you think? I am fine to be contradicted of course!
I share the point of view of Clement. Annotation is not the same as mapping between entities. A text (full document) can be interpreted in different ways and any annotators may have a specific point of view that justify the annotation. Annotation is a process for different purposes. So there is no right and false annotations, it depend of the use case. That why I like W3C OA model. In annotation the main point is to describe the processes in order to understand why the annotation exist. An entity limiting to a string is not a good entity because it is not interpretable, we need context to understand the entity (definition, or graph neighboring, ...). Concerning mapping between entities, to my point of view because entity should be described in a semantic resource that provide the context, we could say that there is good and false mappings.
Another point I do not find the exiting annotation model complex: OA model is not complex. Thus I would not say that SSSOM is simple than others... as usual, it depend how easy you can interpret the model. We use SSSOM for some entities mappings and we have to reinterpret the model for our purpose in some example. The paper will be available soon https://www.frontiersin.org/articles/10.3389/frai.2023.1188036/abstract
The paper will be available soon https://www.frontiersin.org/articles/10.3389/frai.2023.1188036/abstract
Super exciting work! Cool!
An entity limiting to a string is not a good entity because it is not interpretable, we need context to understand the entity (definition, or graph neighboring, ...)
This is of course correct. A literal in this sense is not perceived as an entity, but more like a synonym in the thesaurus sense. And this (associating synonyms to entities) does not at all cover the OA scope - I think for a full model as Chris suggests up top, we indeed have to look at OA! But this is, as you say yourself, totally out of scope for SSSOM.
TBH, technically speaking you are probably right, and I feel bad arguing against you - there is a degree of "laziness" here in the decision of adding the SSSOM literal profile, or to phrase it more positively, a lack of resources to familiarise integrate OA or similar into our toolings, trainings etc. I stand firm on the assertion though that we can integrate these after the fact.
Would you be willing to write a paragraph for the SSSOM docs pointing people that seek to publish "literal mappings" to look at OA first, and explain how they can distinguish broad/narrow/exact using it?
with pleasure, we will have another article from french D2KAb project that will provide 3 examples of OA uses in agronomical context. Could you let me know where to add the parapagraph about OA documentation. Maybe I could write if in the git issue and let you copy past where you want...
@croussey Please feel free to rework this file in the documentation to help people picking the right standard: https://github.com/mapping-commons/sssom/blob/master/src/docs/sssom-profiles.md
There are a number of different tools that perform NER on text, from bioportal/zooma through to scispacy, @cthoyt's Gilda ( gilda https://www.biorxiv.org/content/10.1101/2021.09.10.459803v1.full )
These all vary in their output but are some variant of text span location and ID plus metadata for the matched concept.
While the entity normalization step of NER could be seen as term matching, I think this is out of scope for SSSOM. However, I think it would make sense to have a SSSOM analog, where the SSSOM metadata element URIs are reused.
In fact I did a very quick and dirty first pass at this:
https://incatools.github.io/ontology-access-kit/datamodels/text-annotator/index.html https://github.com/INCATools/ontology-access-kit/blob/main/src/oaklib/datamodels/text_annotator.yaml
I think it would be useful to standardize on this, for applications like our https://github.com/monarch-initiative/ontorunner that wrap multiple different annotators for aggregating results, cc @hrshdhgd
cc @graybeal