Precomputed embeddings - Githubissues

bschilder commented 1 year ago

Hello,

Thanks again for the awesome framework!

Our lab is working on a project that involves cross-species comparisons, and to start I'm looking to use SATURN to identify homologous cell type mappings across datasets/species.

I noticed in the following file is seems your team has already precomputed gene embeddings for quite a few of a species that I'm investigating. I have already begun regenerating these embeddings, but a you know this can take quite a while (especially across many species).

Would your team possibly be willing to shared these embeddings so that users like myself can skip the embedding pre-steps and go right to training SATURN? Figshare is great, but any storage platform would be welcome. https://github.com/snap-stanford/SATURN/blob/f0813fb5300a3ada69415dc9141c59e8bc4e5cb6/data/gene_embeddings.py#L14

While sharing all of them would be welcome, the ones that are highest priority for me are:

   'human': EMBEDDING_DIR / 'Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM1b.pt',
        'mouse': EMBEDDING_DIR / 'Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM1b.pt',
        'frog': FZ_EMBEDDING_DIR / 'Xenopus_tropicalis.Xenopus_tropicalis_v9.1.gene_symbol_to_embedding_ESM1b.pt',
        'zebrafish': FZ_EMBEDDING_DIR / 'Danio_rerio.GRCz11.gene_symbol_to_embedding_ESM1b.pt', 
        "mouse_lemur": FZ_EMBEDDING_DIR / "Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM1b.pt", 
        "fly": FZ_EMBEDDING_DIR / 'Drosophila_melanogaster.BDGP6.32.gene_symbol_to_embedding_ESM1b.pt',
        "pig": FZ_EMBEDDING_DIR / 'Sus_scrofa.Sscrofa11.1.gene_symbol_to_embedding_ESM1b.pt',
        "macaca_fascicularis": FZ_EMBEDDING_DIR / 'Macaca_fascicularis.Macaca_fascicularis_6.0.gene_symbol_to_embedding_ESM1b.pt', 
        "rat": FZ_EMBEDDING_DIR / 'Rattus_norvegicus.mRatBN7.2.gene_symbol_to_embedding_ESM1b.pt'

Thanks again! Brian

Yanay1 commented 1 year ago

Hi Brian, please see this section of the readme: https://github.com/snap-stanford/SATURN#data-availability

We made protein embeddings available for all species analyzed in the paper here: http://snap.stanford.edu/saturn/data/

bschilder commented 1 year ago

Thanks @Yanay1 , and apologies for having missed that earlier. The protein_embeddings.tar.gz does indeed contain the majority of the species I mentioned.

That said, there still seems to be a couple missing. Of those, I'm particularly interested in "rat" and "fly". I don't suppose you'd be able to share those as well?

        'bat':  FZ_EMBEDDING_DIR / 'Rhinolophus_ferrumequinum.mRhiFer1_v1.gene_symbol_to_embedding_ESM1b.pt', 
        "sea_squirt": FZ_EMBEDDING_DIR / 'Ciona_intestinalis.KH.gene_symbol_to_embedding_ESM1b.pt',
        "chicken": FZ_EMBEDDING_DIR / 'Gallus_gallus.GRCg6a.gene_symbol_to_embedding_ESM1b.pt',
        "fly": FZ_EMBEDDING_DIR / 'Drosophila_melanogaster.BDGP6.32.gene_symbol_to_embedding_ESM1b.pt', 
        "rat": FZ_EMBEDDING_DIR / 'Rattus_norvegicus.mRatBN7.2.gene_symbol_to_embedding_ESM1b.pt',
        "tree_shrew": FZ_EMBEDDING_DIR / 'Tupaia_belangeri.TREESHREW.gene_symbol_to_embedding_ESM1b.pt'

Yanay1 commented 1 year ago

Shared via email!

bschilder commented 1 year ago

Perfect, thanks @Yanay1 !

dana-mcc commented 1 year ago

I'm struggling a bit with generating the embedding for Callithrix jacchus, and I was wondering if precomputed embedding exists for this species? Also, can you explain how to use .torch files? The frog_zebrafish_embryogenesis vignette starts with a .pt file.

Yanay1 commented 1 year ago

Hi,

I can generate the Marmoset protein embeddings for you! It should be done in the next few days.

Torch files store torch tensors, in this case, we use them to store the protein embeddings. They can be read in using

torch.load(...path)

https://pytorch.org/docs/stable/generated/torch.load.html

dana-mcc commented 1 year ago

Wow, thank you so much! I really appreciate your help.

Best, Dana McCormack Senior Research Support Associate Feng Lab, MIT Pronouns: they/them

From: Yanay1 @.> Sent: Thursday, August 3, 2023 9:26 PM To: snap-stanford/SATURN @.> Cc: Dana McCormack @.>; Comment @.> Subject: Re: [snap-stanford/SATURN] Precomputed embeddings (Issue #19)

Hi,

I can generate the Marmoset protein embeddings for you! It should be done in the next few days.

Torch files store torch tensors, in this case, we use them to store the protein embeddings. They can be read in using

torch.load(...path)

https://pytorch.org/docs/stable/generated/torch.load.html

— Reply to this email directly, view it on GitHubhttps://github.com/snap-stanford/SATURN/issues/19#issuecomment-1664843584, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3K6QAMVMNOHKQS6PWS4L2LXTRFTPANCNFSM6AAAAAAYG7FLOY. You are receiving this because you commented.Message ID: @.***>

Yanay1 commented 1 year ago

ESM1b Embeddings: https://drive.google.com/file/d/19qsKrFO153EId4uP7Ip4YYtLbEDHUoKF/view?usp=drive_link

ESM2 Embeddings: https://drive.google.com/file/d/1ItTBC27WQ968gkGVwdjzgMMrt2mHdTsb/view?usp=sharing

The proteome comes from here: https://useast.ensembl.org/Callithrix_jacchus/Info/Index

You should download the whole thing, not sure why google drive shows it as a folder-- it's one file.

dana-mcc commented 1 year ago

Thank you so much!!!

Best, Dana McCormack Senior Research Support Associate Feng Lab, MIT Pronouns: they/them

From: Yanay1 @.> Sent: Friday, August 4, 2023 3:22 PM To: snap-stanford/SATURN @.> Cc: Dana McCormack @.>; Comment @.> Subject: Re: [snap-stanford/SATURN] Precomputed embeddings (Issue #19)

ESM1b Embeddings: https://drive.google.com/file/d/19qsKrFO153EId4uP7Ip4YYtLbEDHUoKF/view?usp=drive_link

ESM2 Embeddings: https://drive.google.com/file/d/1ItTBC27WQ968gkGVwdjzgMMrt2mHdTsb/view?usp=sharing

The proteome comes from here: https://useast.ensembl.org/Callithrix_jacchus/Info/Index

— Reply to this email directly, view it on GitHubhttps://github.com/snap-stanford/SATURN/issues/19#issuecomment-1666070727, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3K6QANCH72LKH4KKWCFSILXTVDX3ANCNFSM6AAAAAAYG7FLOY. You are receiving this because you commented.Message ID: @.***>

dana-mcc commented 1 year ago

Hi,

I have some follow up questions about using SATURN:

The adata objects listed in the output of train-saturn.py have fewer genes than the original objects. For example, our marmoset data has 25571 genes originally and has 13784 genes when it's passed in train-saturn. Is that just reflective of the genes that are able to be mapped through Ensembl and ESM?
How did you determine the default number of hv_genes and num_macrogenes? Was there a sweet spot of genes that produced the best embedding? In my experience with other models, I've noticed that the number of highly variable genes and specifically which genes are entered into the model can drastically affect the results. Are there circumstances where hv_genes and num_macrogenes might work better with higher/lower numbers?
How is the number of genes assigned to a macrogene determined? Is there a way to change this?
The default dimension size for the model is 256. Can you explain reasons (if any) to manipulate this variable?
With other species integration techniques as well as with batch correction, I've noticed that the model usually works best with greater diversity in cell type. Is this true in SATURN as well, or does the basis on the triplet margin loss function allow for a more narrow dataset?
Can you elaborate on using multiple seeds for the model? Should it be standard practice to generate the model many times? Can you explain where the stochastic elements are in making the model?

Once again, thank you so much for your help with generating the .pt files and any insight you can offer! SATURN is such a cool tool and I am enjoying learning how to work with it.

Best, Dana McCormack Senior Research Support Associate Feng Lab, MIT Pronouns: they/them

From: Dana McCormack @.> Sent: Friday, August 4, 2023 3:29 PM To: snap-stanford/SATURN @.> Cc: Margaret Elizabeth Schroeder @.***> Subject: Re: [snap-stanford/SATURN] Precomputed embeddings (Issue #19)

Thank you so much!!!

Best, Dana McCormack Senior Research Support Associate Feng Lab, MIT Pronouns: they/them

From: Yanay1 @.> Sent: Friday, August 4, 2023 3:22 PM To: snap-stanford/SATURN @.> Cc: Dana McCormack @.>; Comment @.> Subject: Re: [snap-stanford/SATURN] Precomputed embeddings (Issue #19)

ESM1b Embeddings: https://drive.google.com/file/d/19qsKrFO153EId4uP7Ip4YYtLbEDHUoKF/view?usp=drive_link

ESM2 Embeddings: https://drive.google.com/file/d/1ItTBC27WQ968gkGVwdjzgMMrt2mHdTsb/view?usp=sharing

The proteome comes from here: https://useast.ensembl.org/Callithrix_jacchus/Info/Index

— Reply to this email directly, view it on GitHubhttps://github.com/snap-stanford/SATURN/issues/19#issuecomment-1666070727, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3K6QANCH72LKH4KKWCFSILXTVDX3ANCNFSM6AAAAAAYG7FLOY. You are receiving this because you commented.Message ID: @.***>

Yanay1 commented 1 year ago

Yes that's exactly it! Unfortunately there's not a reference protein for every gene. The genes are also matched using string matching so you might be able to correct some mappings by checking which genes are not in the protein embedding file but are in your anndata.
It seems the algorithm is fairly stable regardless of the number of macrogenes, which we showed in Supplementary Figure 8 (of the current version on Arxiv). We choose 8000 genes as the number of highly variable genes based on the number of available genes for frog, which was around 9500, just so we could still filter out some that weren't highly variable but try to maintain as many as possible, and then just used that number for every other dataset. We didn't do any extra benchmarking of this parameter in the current version on Arxiv, feel free to try different combinations! Naturally including more genes and more macrogenes might lead to more fine cell typing, to a point.
The genes are mapped to macrogenes using the scoring initialization function. In the latest version of the code, there is a parameter centroid_score_func that gives a few more options to choose from, but we didn't find a big difference between them. You can modify the code to use any initialization function, including one that maps specific numbers of genes to each macrogene, if you'd like.
This was another hyperparameter that we didn't want to overfit to our benchmarks with. It seemed to work well across every dataset we tried, so maybe it's a stable setting (in combination with the number of layers, epochs and learning rates).
That's an interesting question! I think it depends on what you mean by diversity-- with a large number of very similar, finely defined cell types, any algorithm will struggle, but when there are more broad cell types, such as in our tabula or embryogenesis examples, I think the scores are generally better. We were still able to find differences in fine cell types and cell types with low abundance however.
We used multiple seeds primarily for benchmarking. With deep learning based models, like SATURN and scVI, the weight initializations and mini batches are randomized, so this is a source of surprisingly (to me) high variation in final model quality. With the non deep learning based methods, there are still certain components that may have randomness such as data shuffling or batching/numerical precision errors that also lead to variation, however the deep learning based models seemed to have the highest variance.

For users of scVI or other single cell VAE based methods I don't believe it is standard practice to run the model many times, but this might be an interesting question to ask their team. I think in our case, such as in figure 4B, it can be useful to run the model multiple times to build confidence in something like a reannotation, but it's maybe not wholly necessary.

dana-mcc commented 1 year ago

Sorry for the slow response, a thank you email was sitting in my drafts and I completely forgot about it! Your email was very helpful.

I have another question about the package that I'm hoping you can help with. I'm trying to transfer the SATURN UMAP embedding to an anndata object with the original genes. The goal is to be able to plot individual genes to show species differences but with the superior SATURN integration.

For some reason, the indices seem to be getting mixed up along the way (ie the cell types no longer being separated in the UMAP space). I've transferred embedding like this before from an anndata object with a subset of genes to the full anndata object, but maybe there's something different when you convert from the macrogene space?

I re-indexed the datasets prior to transferring the embedding like this: (where saturn_adata is the anndata object generated by saturn and has macrogenes for var, and adata is the combined anndata object from the same files submitted to saturn) barcodes = list(adata.obs.index) saturn_adata = saturn_adata[barcodes, :].copy()

and confirmed the re-indexing was successful from the below code returning True: adata_barcodes = list(adata.obs.index) saturn_adata_barcodes = list(saturn_adata.obs.index) adata_barcodes == saturn_adata_barcodes

I tried transferring the embedding in a few different ways:

adata.obsm = saturn_adata.obsm
adata.obsm['X_umap'] = saturn_adata.obsm['X_umap']
umap_df = saturn_adata.obsm['X_umap'] converted to a csv and imported into adata

Even weirder, when I tried to plot a few marker genes to troubleshoot, the expression pattern appeared null. If all of the cell types were mixed, then I should be seeing a mix of positive and null expression.

[cid:c860cecf-69db-4b85-beba-d6382ad92e6d] Do you have any ideas about what could be happening or recommendations for a different method? I'm surprised that this isn't transferring easily given that UMAP is a 2D representation that is disconnected from the macrogene space.

Best, Dana McCormack Senior Research Support Associate Feng Lab, MIT Pronouns: they/them

From: Yanay1 @.> Sent: Thursday, August 10, 2023 2:55 AM To: snap-stanford/SATURN @.> Cc: Dana McCormack @.>; Comment @.> Subject: Re: [snap-stanford/SATURN] Precomputed embeddings (Issue #19)

Yes that's exactly it! Unfortunately there's not a reference protein for every gene. The genes are also matched using string matching so you might be able to correct some mappings by checking which genes are not in the protein embedding file but are in your anndata.
It seems the algorithm is fairly stable regardless of the number of macrogenes, which we showed in Supplementary Figure 8 (of the current version on Arxiv). We choose 8000 genes as the number of highly variable genes based on the number of available genes for frog, which was around 9500, just so we could still filter out some that weren't highly variable but try to maintain as many as possible, and then just used that number for every other dataset. We didn't do any extra benchmarking of this parameter in the current version on Arxiv, feel free to try different combinations! Naturally including more genes and more macrogenes might lead to more fine cell typing, to a point.
The genes are mapped to macrogenes using the scoring initialization function. In the latest version of the code, there is a parameter centroid_score_func that gives a few more options to choose from, but we didn't find a big difference between them. You can modify the code to use any initialization function, including one that maps specific numbers of genes to each macrogene, if you'd like.
This was another hyperparameter that we didn't want to overfit to our benchmarks with. It seemed to work well across every dataset we tried, so maybe it's a stable setting (in combination with the number of layers, epochs and learning rates).
That's an interesting question! I think it depends on what you mean by diversity-- with a large number of very similar, finely defined cell types, any algorithm will struggle, but when there are more broad cell types, such as in our tabula or embryogenesis examples, I think the scores are generally better. We were still able to find differences in fine cell types and cell types with low abundance however.
We used multiple seeds primarily for benchmarking. With deep learning based models, like SATURN and scVI, the weight initializations and mini batches are randomized, so this is a source of surprisingly (to me) high variation in final model quality. With the non deep learning based methods, there are still certain components that may have randomness such as data shuffling or batching/numerical precision errors that also lead to variation, however the deep learning based models seemed to have the highest variance.

For users of scVI or other single cell VAE based methods I don't believe it is standard practice to run the model many times, but this might be an interesting question to ask their team. I think in our case, such as in figure 4B, it can be useful to run the model multiple times to build confidence in something like a reannotation, but it's maybe not wholly necessary.

— Reply to this email directly, view it on GitHubhttps://github.com/snap-stanford/SATURN/issues/19#issuecomment-1672661818, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3K6QANNI3R7JQYIA6DQKSLXUSAVVANCNFSM6AAAAAAYG7FLOY. You are receiving this because you commented.Message ID: @.***>

Yanay1 commented 1 year ago

Would it be possible to upload a picture of the umaps/the full code snippet? (Can also email them)

dana-mcc commented 1 year ago

Whoops, I didn't realize that this conversation was on the github issues! The responses went to my email so I assumed it was there. What email address is best?

Yanay1 commented 1 year ago

yanay (at) stanford.edu

From: dana-mcc @.> Sent: Monday, September 11, 2023 4:21 PM To: snap-stanford/SATURN @.> Cc: Yanay Rosen @.>; Mention @.> Subject: Re: [snap-stanford/SATURN] Precomputed embeddings (Issue #19)

Whoops, I didn't realize that this conversation was on the github issues! The responses went to my email so I assumed it was there. What email address is best?

— Reply to this email directly, view it on GitHubhttps://github.com/snap-stanford/SATURN/issues/19#issuecomment-1714730788, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACOIF2VYCCD7F5IUHMZK5FTXZ6MGDANCNFSM6AAAAAAYG7FLOY. You are receiving this because you were mentioned.Message ID: @.***>

snap-stanford / SATURN

Precomputed embeddings #19