ncbo / ncbo_annotator

To automatically process a piece of data text to annotate it with relevant ontology concepts and return the annotations.
http://bioportal.bioontology.org/annotator
Other
19 stars 9 forks source link

hyphenated terms not found by annotator given non-hyphenated words #6

Open graybeal opened 5 years ago

graybeal commented 5 years ago

Recently, we used our lookup tool, which calls the Annotator, and noticed a discrepancy in how the Annotator searches for ontology class matches vs a manual search. Basically, the Annotator does not seem to have the flexibility to deal with hyphens or similar characters while the manual search in an ontology will find matches. The example uses an input term of “sodium iodide symporter”. MESH has a “sodium-iodide symporter” but this is not found using the Annotator. Instead, the Annotator finds matches just to sodium iodide (see excel attachment). Is this an issue of which you are already aware? If so, is there a plan for an Annotator version update or would a fix be simple enough to implement in the near future?

graybeal commented 5 years ago

The reason the two processes are finding different strings is not so much (or not just) because of the hyphen, but because search includes the description in the search; Annotator does not. The non-hyphenated string appears exactly in the description, and so search is finding that. (Search for "member 5 protein" and hit return to see the same result.)

So it is arguable whether this is a bug, a feature, or just a possible enhancement. I am pretty sure the mgrep method used by the annotator is quite strict about finding exact matches, which this is not. While we don't have any short-term plans for Annotator updates, we can look at whether a simple solution is available that would be smarter (or maybe, 'looser') about hyphens. Because we are using the mgrep algorithm, it may be either very easy or very time-consuming to update.

jonquet commented 5 years ago

I know about this issue. Indeed Mgrep is not flexible on the match: it is strict and this is the reason why it does not catch the non hyphen version. The reason why the search service got it is not so because the non hyphen version is within the synonym it is because it relies on Lucene which allow flexible match.

If you search for "sodium iodide symporter" in PR, you will get a match even if this exact string is not a synonym. https://bioportal.bioontology.org/ontologies/PR?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPR_Q63008

Within SIFR, we partially fixed this by offering a "beta" version using a lemmatizer in the back end (and therefore using a lemmatize dictionary too). This allow to fix plurals too. Preliminary (no formal) study show the decrease in precision of using lemmatization l was not match by the increase in recall. So it stays as an option one can decide to use or not depending on what it prefers (precision or recall).

jonquet commented 5 years ago

A side comment to this allow to see however that MeSH's synonyms are not well parsed by the Annotator (not included in the dictionary):

Try to annoate "Nis protein, rat" whici is a altLabel in MeSH of the same class: https://bioportal.bioontology.org/ontologies/MESH?p=classes&conceptid=http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FMESH%2FC070626

Does not return any result. There is an issue in how MeSH's synonyms are defined. Metadata need to be changed.

graybeal commented 5 years ago

In your last comment @jonquet, are you observing that an altLabel with an embedded comma is a problem? If not, can you be a little more explicit about exactly the problem? It seems to me that annotating any string containing a comma is likely to fail, because the comma will be treated as a phrase delimiter by the annotation algorithm.

jonquet commented 5 years ago

The problem is not the comma. Try to anotate "Hormones, Hormone Substitutes, and Hormone Antagonists" and you will get a match (via preferred name).

The problem that is described in this last comment is a synonym parsing issue: Try to annotate "thyroid iodide transporter" with MeSH. You do not get any result even if this expression is a synonym of "sodium-iodide symporter"

I believe the synonym property of MeSH is not well defined in the metadata. An admin need to correct this.