Open ireneisdoomed opened 1 year ago
In terms of vector stores there are many we can use: FAISS (simple and local), LlamaIndex, ChromaDB or Pinecone (cloud-based). Langchain has wrappers around the main vector stores that accommodate common methods for similarity search and clustering, so I'd suggest using implementing them with Langchain in case we want to change the library in the future.
Regarding the embeddings strategy, I'd use here a high quality tokenizer to make sure that the most semantic value is captured (OpenAI's models, basically).
I did a very similar exercise one year ago with a "basic" BERT model Tokenizer. The code for inspiration can be seen here: https://github.com/ireneisdoomed/random_notebooks/tree/main/text_similarity
We have implemented:
Todo: implement in the FE like a confidence-based ranking system, that shows suggested terms and allows the user to choose from the top 5 most similar EFO IDs.
To be able to extract the 1:M relationships between diseases and phenotypes, it would be useful to follow an ontology based approach of grounding the identifying terms to common term definitions.
Using an ontology gives us 2 important benefits:
As per the ontologies we should be using, I see 2 interesting possibilities and I don't know which will yield better results:
Using the Experimental Factor Ontology (EFO). a) I'd define as phenotype every child of these terms:
Using MONDO and the Human Phenotype ontology
The benefits of using EFO is that I am familiar with the dataset structure OT provides, and the benefits of using MONDO/HP is that I have the impression that their coverage and ontological representation might be slightly for this task.