This analysis can later be applied to build a recommendation system in the OpenVirome app and paper for suggesting indirectly related viromes. The embeddings can also be used to create unsupervised virome clusters to support "Global text search" functionality in a GraphRAG/LLM service.
The high-level idea for creating embeddings is to use the metadata as initial low-dimensional generic features to train higher-dimensional embeddings that are more specific to the "virome" co-occurrence topology we're interested in. These embeddings should capture a distance that considers both direct metadata similarity but also indirect similarity related to virome composition, which is useful for recommendation systems and RAG.
Background / Context
From past work, we have generated sufficient metadata features to use for creating graph embeddings. This included entity resolution by clustering taxonomy and tissue labels using an onotology, and metadata imputation using feature propagation on a phylogentic network.
IMO it makes sense to start testing a final application with the data we have to see how it performs on more complex tasks than direct lookups. After this, we can go back and make improvements to the metadata imputation and entity resolution steps.
Hypothesis
Using metadata features and co-occurrence topology, we can create embeddings of sOTU viromes that capture biologically meaningful distance (described below).
Experiment
[ ] Create a sOTU-sOTU network based on whether two sOTUs have co-occured in the same SRA run.
Edges can be weighted by the number of times they have co-occurred.
[ ] Split network into disjoint train and test datasets.
Using disjoint sets prevents the possibility of information leakage and is typical for training inductive GNNs
Some care will need be taken to ensure train and test data have a similar distribution of viromes
[ ] Train an inductive model using a link prediction pipeline to predict masked co-occurence edges between sOTUs.
We can consider this as a semi-supervised pretraining step. This is similar to masking tokens in protein language models but in our case we are masking edges in a co-occurence network.
During training, the model performs message passing of features along unmasked co-occurrence edges to build embeddings
The embeddings are updated through supervision based on accuracy of predicting masked co-occurence edges
[ ] Evaluate the inductive model on the unseen test data using a similar masking and message passing process
Standard link prediction eval is AUC-PR (Area under precision recall curve)
[ ] Evaluate embeddings of a specific KNN cluster (Eimeria and Kadipiro)
Aside from evaluating the pre-training task, we can inspect the final embedding model is producing meaningful distances
Precision/Recall @ K for direct matches
Find sOTUs that are not direct matches to target query (Eimeria or Kadipiro) and manually review if recommendation is sensible
Measure average metadata similarity within the KNN cluster (cosine distance or jaccard index)
[ ] Future work
GraphRAG: Create unsupervised clusters on embeddings, use LLM to create summaries of metadata and bioprojects
Recommendation system: Given an input virome query, recommend a similar virome that isn't already included in direct matches of the input
Controls
Positive control: sOTUs that commonly co-occur have low cosine distance in embedding space
Negative control: sOTUs that rarely co-occur have high cosine distance in embedding space
Expected Outcome
When analyzing the Kadipiro embedding cluster, we find interesting and relevant sOTUs and BioProjects that could not have been discovered from direct metadata matches or BLAST searches
Open Questions
Will the difference/overlap between metadata feature distributions and co-occurrence be sufficient for learning embeddings?
(Later) implementation details of recommendation system given a list of sOTUs
We can use a MIPSKNN index for fast sOTU recommendations, but may want to use graph embeddings for recommending Runs or BioProjects
Overview
This analysis can later be applied to build a recommendation system in the OpenVirome app and paper for suggesting indirectly related viromes. The embeddings can also be used to create unsupervised virome clusters to support "Global text search" functionality in a GraphRAG/LLM service.
The high-level idea for creating embeddings is to use the metadata as initial low-dimensional generic features to train higher-dimensional embeddings that are more specific to the "virome" co-occurrence topology we're interested in. These embeddings should capture a distance that considers both direct metadata similarity but also indirect similarity related to virome composition, which is useful for recommendation systems and RAG.
Background / Context
From past work, we have generated sufficient metadata features to use for creating graph embeddings. This included entity resolution by clustering taxonomy and tissue labels using an onotology, and metadata imputation using feature propagation on a phylogentic network.
IMO it makes sense to start testing a final application with the data we have to see how it performs on more complex tasks than direct lookups. After this, we can go back and make improvements to the metadata imputation and entity resolution steps.
Hypothesis
Using metadata features and co-occurrence topology, we can create embeddings of sOTU viromes that capture biologically meaningful distance (described below).
Experiment
[ ] Create a sOTU-sOTU network based on whether two sOTUs have co-occured in the same SRA run.
[ ] Split network into disjoint train and test datasets.
[ ] Train an inductive model using a link prediction pipeline to predict masked co-occurence edges between sOTUs.
[ ] Evaluate the inductive model on the unseen test data using a similar masking and message passing process
[ ] Evaluate embeddings of a specific KNN cluster (Eimeria and Kadipiro)
[ ] Future work
Controls
Positive control: sOTUs that commonly co-occur have low cosine distance in embedding space
Negative control: sOTUs that rarely co-occur have high cosine distance in embedding space
Expected Outcome
When analyzing the Kadipiro embedding cluster, we find interesting and relevant sOTUs and BioProjects that could not have been discovered from direct metadata matches or BLAST searches
Open Questions
References
Link prediction:
Heterogeneous Recommendation system:
SageConv inductive representation learning