ireneisdoomed / phenomena

Inspect what phenotypes are associated with a disease
https://ireneisdoomed-phenomena-app-pu8o79.streamlit.app/
MIT License
1 stars 0 forks source link

Normalise the extracted entities to an ontology #4

Open ireneisdoomed opened 1 year ago

ireneisdoomed commented 1 year ago

To be able to extract the 1:M relationships between diseases and phenotypes, it would be useful to follow an ontology based approach of grounding the identifying terms to common term definitions.

Using an ontology gives us 2 important benefits:

As per the ontologies we should be using, I see 2 interesting possibilities and I don't know which will yield better results:

The benefits of using EFO is that I am familiar with the dataset structure OT provides, and the benefits of using MONDO/HP is that I have the impression that their coverage and ontological representation might be slightly for this task.

ireneisdoomed commented 1 year ago

In terms of vector stores there are many we can use: FAISS (simple and local), LlamaIndex, ChromaDB or Pinecone (cloud-based). Langchain has wrappers around the main vector stores that accommodate common methods for similarity search and clustering, so I'd suggest using implementing them with Langchain in case we want to change the library in the future.

Regarding the embeddings strategy, I'd use here a high quality tokenizer to make sure that the most semantic value is captured (OpenAI's models, basically).

ireneisdoomed commented 1 year ago

I did a very similar exercise one year ago with a "basic" BERT model Tokenizer. The code for inspiration can be seen here: https://github.com/ireneisdoomed/random_notebooks/tree/main/text_similarity

ireneisdoomed commented 1 year ago

We have implemented:

Todo: implement in the FE like a confidence-based ranking system, that shows suggested terms and allows the user to choose from the top 5 most similar EFO IDs.