pick ontologies for NLP

diatomsRcool commented 5 years ago

Suggested ontologies for use by the NLP tool:

MONDO
HPO
LOINC
NCIT
CHEBI
GO
UBERON
FOODON
ENVO
NCBITAXON
OBA

diatomsRcool commented 5 years ago

thoughts @mellybelly

diatomsRcool commented 5 years ago

from @kwuichet: Currently using CLAMP which extracts concepts and links them to UMLS CUIs. UMLS can then be linked to other libraries (SNOMED, LOINC, NCIt...). Other software can be explored if needed.

diatomsRcool commented 5 years ago

I just realized I put this issue in the wrong repo. :(

diatomsRcool commented 5 years ago

from Melissa: all of these plus ECTO and RXNORM

diatomsRcool commented 5 years ago

@kwuichet Any comment on these ontologies?

MONDO
HPO
LOINC
NCIT
CHEBI
GO
UBERON
FOODON
ENVO
NCBITAXON
OBA
ECTO
RXNORM

diatomsRcool commented 5 years ago

Hi Anne!

CLAMP currently works with the UMLS, but we are looking to filter the output to the most relevant/important vocabularies since UMLS includes so many highly specific vocabularies.

Here is the list: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html

From this I was thinking about NCIt, SNOMED, LOINC, MeSH, OMIM, HPO… But I know there are issues with SNOMED licensing.

Do these makes sense to you? Too many? Too few? Are there others I should consider? We could also cast a wide net for all English vocabularies with level 0 restriction. I’d appreciate your input since you are the expert in this!

Best,

Kristin

mellybelly commented 5 years ago

Some initial thoughts:

Please only use openly licensed terminologies, else we run into redistribution issues. I like the list Anne put together based on the data we have examined together in the data harm group.
I would also use the terminologies natively rather than via UMLS mapping (but I am uncertain as to how CLAMP can be configured). Also many important ontologies from the OBO Foundry are not in the UMLS.
Why CLAMP? I have been told that the current highest performing NER (including normalization) pipeline freely available now is OGER++ (https://github.com/OntoGene/OGER and web service at https://pub.cl.uzh.ch/projects/ontogene/oger/ which ought to be easy to deploy (but isn’t yet dockerized).
I would look at whether CLAMP can utilize the OLS API rather than UMLS: https://www.ebi.ac.uk/ols/docs/api
Perhaps we need to understand the requirements for the NER/NLP a bit better? Are these defined somewhere?

diatomsRcool commented 5 years ago

Are we locked in to CLAMP? There's no real deadline for this yet so we have some wiggle room. I'm doing some research into OGER now.

I don't know of any requirements documents for NER/NLP, but the data harm WG requirements are to reduce human annotation effort (for data sets in STAGE) and provide input for the knowledge graph query.

kwuichet commented 5 years ago

This started as an exploratory exercise to try to assess what level of assistance NLP could provide in this space. We did not seek to explore all NLP tools with the understanding that while there will be some differences between the results of various methods overall there would be large overlap in the general results. So is the concern that CLAMP cannot currently provide what is needed or that NLP cannot currently provide what is needed?

I still want to wrap up the CLAMP evaluation given that we already have that data, and I think it still provides a big picture of what NLP may (and may not) provide in this space, but I don't think we are locked into CLAMP. We happened to like the output a little bit better from CLAMP vs other software we tried. I can talk to my developer to see what kind of effort it would take to test OGER on a small data set if it seems warranted.

CLAMP and OGER are each utilizing a specific library: CLAMP uses UMLS whereas OGER uses Bio Term Hub (but also links to UMLS CUI when available). For the previously suggested list of vocabularies within these libraries here is the breakdown: CLAMP and OGER: GO, NCBITAXON, RXNORM CLAMP: HPO, LOINC, NCIt OGER: CHEBI Neither: MONDO, UBERON, FOODON, ENVO, OBA, ECTO

I think the vocabulary differences could become the biggest factor in the choice of NLP direction, and that's an area that can explored. I am not specifically knowledgeable how configurable the libraries are for other systems.

diatomsRcool commented 5 years ago

Thanks for the details. My favorability toward OGER comes from a discussion with a colleague who is an expert in biomedical NLP. He says that OGER has the highest performance at the moment. I would hate to go with a tool that has lower performance because there is a link to a desired vocabulary, especially if we can either reconfigure the vocabularies or use some of the tools we have in Monarch to make the desired links. Perhaps the best way to make the decision is:

You should finish the CLAMP evaluation
We both list out our "hopes and dreams" for this NLP Tool. And by hopes and dreams, I mean requirements. I'll give some thought to mine and add them here.
We use 1 and 2 to make a decision abut CLAMP v OGER (or something else). Thoughts?

kwuichet commented 5 years ago

Sounds good! I think we are on the same page. I'll be thinking about #2 as well.

mellybelly commented 5 years ago

Can we document requirements as part of this process? easier to collect as we go ;-).

diatomsRcool commented 5 years ago

Let's collect requirements here.

helxplatform / dbgaptools

pick ontologies for NLP #5