Closed killerz99 closed 10 months ago
Thanks for finding this issue!
While we work on a fix I think that change should work?
the run completed: Loaded model: ./model_files/4layer_model.torch 100%|████████████████████████████████████████| 480/480 [10:48<00:00, 1.35s/it] Wrote Anndata to: ./10k_pbmcs_proc_uce_adata.h5ad
I just updated the file (in the main branch) so if you pull again I think this issue should be fixed?
thanks!
Can I ask what GPU you are using? Thanks!
lol, I'm using a M1 Max Macbook.
I got a weird looking umap for the UCE embeddings on the pbmc example. Did I miss a step?
Great that it works on Macbook!
Are you plotting embeddings using sc.pl.embedding?
If so: The UCE embedding is 1280 dimensional, so you can't use it directly for visualization.
Instead, you can do the following to generate a umap:
sc.pp.neighbors(adata, use_rep="X_uce")
sc.tl.umap(adata)
sc.pl.umap(adata ....
If you don't want to calculate neighbors using the full 1280 dimensional embedding, you can create a new anndata by using the .X_uce
slot as the .X
, and then run pca and then neighbors and umap /tsne.
That would be doing something like:
new_adata = sc.AnnData(adata.obsm["X_uce"])
sc.pp.pca(new_adata)
sc.pp.neighbors(new_adata)
sc.tl.umap(new_adata)
adata.obsm["X_umap"] = new_adata.obsm["X_umap"]
Yeah, the runtime was good (about 15-20min for the test set), it uses ~95% of the Mac GPU.
Thanks for the help. It turns out that pca of the embeddings seems more interpretable than a umap of pca embeddings. The umaps looks pretty similar between methods.
PCA of embeddings umap method1 umap method2
So the UMAPs there look like PCA plots, so I think it's possible that there might be an issue?
hmm.. this code alone will generate the pca plot
import scanpy as sc
adata = sc.read_h5ad('10k_pbmcs_proc_uce_adata.h5ad')
new_adata = sc.AnnData(adata.obsm["X_uce"])
sc.pp.pca(new_adata)
adata.obsm["X_pca"] = new_adata.obsm["X_pca"]
sc.pl.embedding(adata, basis='pca', color='cell_type')
How did you generate the UMAP plots?
import scanpy as sc
adata = sc.read_h5ad('10k_pbmcs_proc_uce_adata.h5ad')
new_adata = sc.AnnData(adata.obsm["X_uce"])
sc.pp.pca(new_adata)
adata.obsm["X_pca"] = new_adata.obsm["X_pca"]
sc.pl.embedding(adata, basis='pca', color='cell_type')
sc.pp.neighbors(new_adata)
sc.tl.umap(new_adata)
adata.obsm["X_umap"] = new_adata.obsm["X_umap"]
sc.pl.embedding(adata, basis='umap', color='cell_type')
What happens if you do:
sc.pl.umap(new_adata)
sorry, its my fault.. I'm using two different python environments.. if you don't have the UCE environment you get the messed up umap.. otherwise it looks fine..
method1 uce umap
sc.pp.neighbors(adata, use_rep="X_uce")
sc.tl.umap(adata)
sc.pl.umap(adata)
Awesome thanks!
In case you're interested, it would be interesting to see if the Mac can run the 33 layer model as well!
yeah, I can try it, which line should I modify? I also wanted to generate clusters without using the cell labels. So, just cluster the uce embeddings?
For generating clusters you should be able to use the scanpy default functions, like leiden, just make sure to use the X_uce
space if needed. For sc.tl.leiden
it is neighbors based so it shouldn't matter. However, for some functions like sc.tl.dendrogram
there is a use_rep
argument which should be set to X_uce
otherwise you might use the gene expression space.
For the 33 layer model it seems it's not on the Figshare yet so I can get back to you on that.
The 33 layer model is now uploaded here: https://figshare.com/articles/dataset/Universal_Cell_Embedding_Model_Files/24320806?file=43423236
So you would need to download it into, and change the model_loc
and nlayers
arguments to eval_single_anndata.py
Still getting the same error:
Using sample 4 layer model
Proccessing m7_central_retina_adjusted
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ [/home/abhishaikemahajan/abhi_experiments/uce/UCE/eval_single_anndata.py:155](https://vscode-remote+ssh-002dremote-002babhi-002drfdiffusion.vscode-resource.vscode-cdn.net/home/abhishaikemahajan/abhi_experiments/uce/UCE/eval_single_anndata.py:155) │
│ in <module> │
│ │
│ 152 │ │
│ 153 │ args = parser.parse_args() │
│ 154 │ accelerator = Accelerator(project_dir=args.dir) │
│ ❱ 155 │ main(args, accelerator) │
│ 156 │
│ │
│ [/home/abhishaikemahajan/abhi_experiments/uce/UCE/eval_single_anndata.py:83](https://vscode-remote+ssh-002dremote-002babhi-002drfdiffusion.vscode-resource.vscode-cdn.net/home/abhishaikemahajan/abhi_experiments/uce/UCE/eval_single_anndata.py:83) │
│ in main │
│ │
│ 80 │
│ 81 def main(args, accelerator): │
│ 82 │ processor = AnndataProcessor(args, accelerator) │
│ ❱ 83 │ processor.preprocess_anndata() │
│ 84 │ processor.generate_idxs() │
│ 85 │ processor.run_evaluation() │
│ 86 │
│ │
│ [/home/abhishaikemahajan/abhi_experiments/uce/UCE/evaluate.py:94](https://vscode-remote+ssh-002dremote-002babhi-002drfdiffusion.vscode-resource.vscode-cdn.net/home/abhishaikemahajan/abhi_experiments/uce/UCE/evaluate.py:94) in │
│ preprocess_anndata │
│ │
│ 91 │ def preprocess_anndata(self): │
...
╰──────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [Errno 2] No such file or directory:
'model_files/protein_embeddings/Macaca_fascicularis.Macaca_fascicularis_6.0.gene
_symbol_to_embedding_ESM2.pt'
What files do you have in the protein embeddings directory in model_files?
Thanks. I found that you need to transfer over the neighborhood graph from the embedding space to the original adata before you do the clustering, else the obs['leiden'] will not be correctly formatted.
adata.uns['neighbors'] = new_adata.uns['neighbors']
adata.obsp['distances'] = new_adata.obsp['distances']
adata.obsp['connectivities'] = new_adata.obsp['connectivities']
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['leiden'], legend_loc='on data')
33 layer takes about an hour on the test set
python eval_single_anndata.py --model_loc model_files/33l_8ep_1024t_1280.torch --nlayers 33 Using sample AnnData: 10k pbmcs dataset Proccessing 10k_pbmcs_proc 8029.0 10k_pbmcs_proc (11990, 10809) Wrote Shapes Dict 10809 Max Code: 613 Loaded model: model_files/33l_8ep_1024t_1280.torch 100%|█████████████| 480/480 [1:11:10<00:00, 8.90s/it] Wrote Anndata to: ./10k_pbmcs_proc_uce_adata.h5ad
Great, thanks for the update!
What files do you have in the protein embeddings directory
This is the location of my output_dir
not providing any output_dir at all fixed the problem
I think the issue might have been an extra argument in the file download which has now been removed (thanks to @bunnech !).
python eval_single_anndata.py
FileNotFoundError: [Errno 2] No such file or directory: '/dfs/project/cross-species/yanay/data/proteome/embeddings/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt'
fix?
gene_embeddings.py
FZ_EMBEDDING_DIR = Path('/dfs/project/cross-species/yanay/data/proteome/embeddings')
FZ_EMBEDDING_DIR = Path('model_files/protein_embeddings')