Virome recommendation system using graph embeddings

Overview

This analysis can later be applied to build a recommendation system in the OpenVirome app and paper for suggesting indirectly related viromes. The embeddings can also be used to create unsupervised virome clusters to support "Global text search" functionality in a GraphRAG/LLM service.

The high-level idea for creating embeddings is to use the metadata as initial low-dimensional generic features to train higher-dimensional embeddings that are more specific to the "virome" co-occurrence topology we're interested in. These embeddings should capture a distance that considers both direct metadata similarity but also indirect similarity related to virome composition, which is useful for recommendation systems and RAG.

Background / Context

From past work, we have generated sufficient metadata features to use for creating graph embeddings. This included entity resolution by clustering taxonomy and tissue labels using an onotology, and metadata imputation using feature propagation on a phylogentic network.

IMO it makes sense to start testing a final application with the data we have to see how it performs on more complex tasks than direct lookups. After this, we can go back and make improvements to the metadata imputation and entity resolution steps.

Hypothesis

Using metadata features and co-occurrence topology, we can create embeddings of sOTU viromes that capture biologically meaningful distance (described below).

Experiment

[ ] Create a sOTU-sOTU network based on whether two sOTUs have co-occured in the same SRA run.
- Edges can be weighted by the number of times they have co-occurred.
[ ] Split network into disjoint train and test datasets.
- Using disjoint sets prevents the possibility of information leakage and is typical for training inductive GNNs
- Some care will need be taken to ensure train and test data have a similar distribution of viromes
[ ] Train an inductive model using a link prediction pipeline to predict masked co-occurence edges between sOTUs.
- We can consider this as a semi-supervised pretraining step. This is similar to masking tokens in protein language models but in our case we are masking edges in a co-occurence network.
- During training, the model performs message passing of features along unmasked co-occurrence edges to build embeddings
- The embeddings are updated through supervision based on accuracy of predicting masked co-occurence edges
[ ] Evaluate the inductive model on the unseen test data using a similar masking and message passing process
- Standard link prediction eval is AUC-PR (Area under precision recall curve)
- Similar metrics exist for top k predictions (ref)
[ ] Evaluate embeddings of a specific KNN cluster (Eimeria and Kadipiro)
- Aside from evaluating the pre-training task, we can inspect the final embedding model is producing meaningful distances
- Precision/Recall @ K for direct matches
- Find sOTUs that are not direct matches to target query (Eimeria or Kadipiro) and manually review if recommendation is sensible
- Measure average metadata similarity within the KNN cluster (cosine distance or jaccard index)
[ ] Future work
- GraphRAG: Create unsupervised clusters on embeddings, use LLM to create summaries of metadata and bioprojects
- Recommendation system: Given an input virome query, recommend a similar virome that isn't already included in direct matches of the input

Controls

Positive control: sOTUs that commonly co-occur have low cosine distance in embedding space

Negative control: sOTUs that rarely co-occur have high cosine distance in embedding space

Expected Outcome

When analyzing the Kadipiro embedding cluster, we find interesting and relevant sOTUs and BioProjects that could not have been discovered from direct metadata matches or BLAST searches

Open Questions

Will the difference/overlap between metadata feature distributions and co-occurrence be sufficient for learning embeddings?
(Later) implementation details of recommendation system given a list of sOTUs
- We can use a MIPSKNN index for fast sOTU recommendations, but may want to use graph embeddings for recommending Runs or BioProjects

References

Link prediction:

https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/link-prediction/

Heterogeneous Recommendation system:

https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/recommender_system.py

SageConv inductive representation learning

https://arxiv.org/abs/1706.02216

serratus-bio / open-virome