serratus-bio / open-virome

monorepo for data explorer UI and APIs
http://openvirome.com/
GNU Affero General Public License v3.0
0 stars 0 forks source link

Virome recommendation system using graph embeddings #128

Open lukepereira opened 1 week ago

lukepereira commented 1 week ago

Overview

This analysis can later be applied to build a recommendation system in the OpenVirome app and paper for suggesting indirectly related viromes. The embeddings can also be used to create unsupervised virome clusters to support "Global text search" functionality in a GraphRAG/LLM service.

The high-level idea for creating embeddings is to use the metadata as initial low-dimensional generic features to train higher-dimensional embeddings that are more specific to the "virome" co-occurrence topology we're interested in. These embeddings should capture a distance that considers both direct metadata similarity but also indirect similarity related to virome composition, which is useful for recommendation systems and RAG.

Background / Context

From past work, we have generated sufficient metadata features to use for creating graph embeddings. This included entity resolution by clustering taxonomy and tissue labels using an onotology, and metadata imputation using feature propagation on a phylogentic network.

IMO it makes sense to start testing a final application with the data we have to see how it performs on more complex tasks than direct lookups. After this, we can go back and make improvements to the metadata imputation and entity resolution steps.

Hypothesis

Using metadata features and co-occurrence topology, we can create embeddings of sOTU viromes that capture biologically meaningful distance (described below).

Experiment

Controls

Positive control: sOTUs that commonly co-occur have low cosine distance in embedding space

Negative control: sOTUs that rarely co-occur have high cosine distance in embedding space

Expected Outcome

When analyzing the Kadipiro embedding cluster, we find interesting and relevant sOTUs and BioProjects that could not have been discovered from direct metadata matches or BLAST searches

Open Questions

References

Link prediction:

Heterogeneous Recommendation system:

SageConv inductive representation learning

almosnow commented 1 week ago