snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
136 stars 21 forks source link

file not found #4

Closed killerz99 closed 10 months ago

killerz99 commented 10 months ago

python eval_single_anndata.py

FileNotFoundError: [Errno 2] No such file or directory: '/dfs/project/cross-species/yanay/data/proteome/embeddings/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt'

fix?

gene_embeddings.py

FZ_EMBEDDING_DIR = Path('/dfs/project/cross-species/yanay/data/proteome/embeddings')

FZ_EMBEDDING_DIR = Path('model_files/protein_embeddings')

Yanay1 commented 10 months ago

Thanks for finding this issue!

While we work on a fix I think that change should work?

killerz99 commented 10 months ago

the run completed: Loaded model: ./model_files/4layer_model.torch 100%|████████████████████████████████████████| 480/480 [10:48<00:00, 1.35s/it] Wrote Anndata to: ./10k_pbmcs_proc_uce_adata.h5ad

Yanay1 commented 10 months ago

I just updated the file (in the main branch) so if you pull again I think this issue should be fixed?

killerz99 commented 10 months ago

thanks!

Yanay1 commented 10 months ago

Can I ask what GPU you are using? Thanks!

killerz99 commented 10 months ago

lol, I'm using a M1 Max Macbook.

I got a weird looking umap for the UCE embeddings on the pbmc example. Did I miss a step?

embedding_UCE
Yanay1 commented 10 months ago

Great that it works on Macbook!

Are you plotting embeddings using sc.pl.embedding?

If so: The UCE embedding is 1280 dimensional, so you can't use it directly for visualization.

Instead, you can do the following to generate a umap:

sc.pp.neighbors(adata, use_rep="X_uce") sc.tl.umap(adata) sc.pl.umap(adata ....

If you don't want to calculate neighbors using the full 1280 dimensional embedding, you can create a new anndata by using the .X_uce slot as the .X, and then run pca and then neighbors and umap /tsne.

That would be doing something like:

new_adata = sc.AnnData(adata.obsm["X_uce"])
sc.pp.pca(new_adata)
sc.pp.neighbors(new_adata)
sc.tl.umap(new_adata)
adata.obsm["X_umap"] = new_adata.obsm["X_umap"]
killerz99 commented 10 months ago

Yeah, the runtime was good (about 15-20min for the test set), it uses ~95% of the Mac GPU.

Thanks for the help. It turns out that pca of the embeddings seems more interpretable than a umap of pca embeddings. The umaps looks pretty similar between methods.

PCA of embeddings pca umap method1 umap1 umap method2 umap

Yanay1 commented 10 months ago

So the UMAPs there look like PCA plots, so I think it's possible that there might be an issue?

killerz99 commented 10 months ago

hmm.. this code alone will generate the pca plot

import scanpy as sc
adata = sc.read_h5ad('10k_pbmcs_proc_uce_adata.h5ad')
new_adata = sc.AnnData(adata.obsm["X_uce"])
sc.pp.pca(new_adata)
adata.obsm["X_pca"] = new_adata.obsm["X_pca"]
sc.pl.embedding(adata, basis='pca', color='cell_type')
Yanay1 commented 10 months ago

How did you generate the UMAP plots?

killerz99 commented 10 months ago
import scanpy as sc
adata = sc.read_h5ad('10k_pbmcs_proc_uce_adata.h5ad')
new_adata = sc.AnnData(adata.obsm["X_uce"])
sc.pp.pca(new_adata)
adata.obsm["X_pca"] = new_adata.obsm["X_pca"]
sc.pl.embedding(adata, basis='pca', color='cell_type')
sc.pp.neighbors(new_adata)
sc.tl.umap(new_adata)
adata.obsm["X_umap"] = new_adata.obsm["X_umap"]
sc.pl.embedding(adata, basis='umap', color='cell_type')
Yanay1 commented 10 months ago

What happens if you do:

sc.pl.umap(new_adata)
killerz99 commented 10 months ago

sorry, its my fault.. I'm using two different python environments.. if you don't have the UCE environment you get the messed up umap.. otherwise it looks fine..

uce

killerz99 commented 10 months ago

method1 uce umap

sc.pp.neighbors(adata, use_rep="X_uce")
sc.tl.umap(adata)
sc.pl.umap(adata)

method1_uce

Yanay1 commented 10 months ago

Awesome thanks!

In case you're interested, it would be interesting to see if the Mac can run the 33 layer model as well!

killerz99 commented 10 months ago

yeah, I can try it, which line should I modify? I also wanted to generate clusters without using the cell labels. So, just cluster the uce embeddings?

Yanay1 commented 10 months ago

For generating clusters you should be able to use the scanpy default functions, like leiden, just make sure to use the X_uce space if needed. For sc.tl.leiden it is neighbors based so it shouldn't matter. However, for some functions like sc.tl.dendrogram there is a use_rep argument which should be set to X_uce otherwise you might use the gene expression space.

For the 33 layer model it seems it's not on the Figshare yet so I can get back to you on that.

Yanay1 commented 10 months ago

The 33 layer model is now uploaded here: https://figshare.com/articles/dataset/Universal_Cell_Embedding_Model_Files/24320806?file=43423236

So you would need to download it into, and change the model_loc and nlayers arguments to eval_single_anndata.py

Abhishaike commented 10 months ago

Still getting the same error:

Using sample 4 layer model
Proccessing m7_central_retina_adjusted
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ [/home/abhishaikemahajan/abhi_experiments/uce/UCE/eval_single_anndata.py:155](https://vscode-remote+ssh-002dremote-002babhi-002drfdiffusion.vscode-resource.vscode-cdn.net/home/abhishaikemahajan/abhi_experiments/uce/UCE/eval_single_anndata.py:155)  │
│ in <module>                                                                  │
│                                                                              │
│   152 │                                                                      │
│   153 │   args = parser.parse_args()                                         │
│   154 │   accelerator = Accelerator(project_dir=args.dir)                    │
│ ❱ 155 │   main(args, accelerator)                                            │
│   156                                                                        │
│                                                                              │
│ [/home/abhishaikemahajan/abhi_experiments/uce/UCE/eval_single_anndata.py:83](https://vscode-remote+ssh-002dremote-002babhi-002drfdiffusion.vscode-resource.vscode-cdn.net/home/abhishaikemahajan/abhi_experiments/uce/UCE/eval_single_anndata.py:83)   │
│ in main                                                                      │
│                                                                              │
│    80                                                                        │
│    81 def main(args, accelerator):                                           │
│    82 │   processor = AnndataProcessor(args, accelerator)                    │
│ ❱  83 │   processor.preprocess_anndata()                                     │
│    84 │   processor.generate_idxs()                                          │
│    85 │   processor.run_evaluation()                                         │
│    86                                                                        │
│                                                                              │
│ [/home/abhishaikemahajan/abhi_experiments/uce/UCE/evaluate.py:94](https://vscode-remote+ssh-002dremote-002babhi-002drfdiffusion.vscode-resource.vscode-cdn.net/home/abhishaikemahajan/abhi_experiments/uce/UCE/evaluate.py:94) in           │
│ preprocess_anndata                                                           │
│                                                                              │
│    91 │   def preprocess_anndata(self):                                      │
...
╰──────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [Errno 2] No such file or directory: 
'model_files/protein_embeddings/Macaca_fascicularis.Macaca_fascicularis_6.0.gene
_symbol_to_embedding_ESM2.pt'
Yanay1 commented 10 months ago

What files do you have in the protein embeddings directory in model_files?

killerz99 commented 10 months ago

Thanks. I found that you need to transfer over the neighborhood graph from the embedding space to the original adata before you do the clustering, else the obs['leiden'] will not be correctly formatted.

leiden_transfer

adata.uns['neighbors'] = new_adata.uns['neighbors']
adata.obsp['distances'] = new_adata.obsp['distances']
adata.obsp['connectivities'] = new_adata.obsp['connectivities']
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['leiden'], legend_loc='on data')
killerz99 commented 10 months ago

33 layer takes about an hour on the test set

python eval_single_anndata.py --model_loc model_files/33l_8ep_1024t_1280.torch --nlayers 33 Using sample AnnData: 10k pbmcs dataset Proccessing 10k_pbmcs_proc 8029.0 10k_pbmcs_proc (11990, 10809) Wrote Shapes Dict 10809 Max Code: 613 Loaded model: model_files/33l_8ep_1024t_1280.torch 100%|█████████████| 480/480 [1:11:10<00:00, 8.90s/it] Wrote Anndata to: ./10k_pbmcs_proc_uce_adata.h5ad 33layer 33_CT

Yanay1 commented 10 months ago

Great, thanks for the update!

Abhishaike commented 9 months ago

What files do you have in the protein embeddings directory

Screen Shot 2023-12-04 at 8 34 45 AM

This is the location of my output_dir

Abhishaike commented 9 months ago

not providing any output_dir at all fixed the problem

Yanay1 commented 9 months ago

I think the issue might have been an extra argument in the file download which has now been removed (thanks to @bunnech !).