snap-stanford / SATURN

MIT License
104 stars 17 forks source link

KeyError: 'species_gene_name' #49

Closed matejasoretic closed 5 months ago

matejasoretic commented 5 months ago

I have encountered the following issue: When I try running SATURN with 8000HVGs and 2000 macro genes on my multi-species dataset, it will work. However, when I tried increasing these values to 12000 HVGs and 3000 macro genes, I encountered the following error: Traceback (most recent call last): File "/path/to/train-saturn.py", line 1064, in <module> trainer(args) File "/path/to/train-saturn.py", line 575, in trainer centroid_weights.append(torch.tensor(species_genes_scores[sgn])) KeyError: 'axolotl_AMEX60DD000047-TMEM132B' Earlier in the output I could see: After loading the anndata axolotl View of AnnData object with n_obs × n_vars = 4198 × 42647 So there were more than 12000 genes shared between the axolotl object and its corresponding .pt file. Each species in my dataset had more than 12000 genes, the minimum being 15577.

The gene in question had one peptide in the peptide .fa file I used

AMEX60DD201000047.1; gene_id:AMEX60DD000047_TMEM132B; VSEGCDAIFVNGKEMKSKVDTVVNFTFQHFSAQLEVTVWVPRLPLQLEVSDTELSQVKGWRIPSASNKRPTRDSEDEEDDEKKGKGCSLQYQHAMVRVLTQFVSESSDFGGQLTYMLGSDWQFDITDLVKDFMKVEEPRIARLEAGRILSGREQGITTVQVLSPLSDSILAEKTVTVLDDRVTITDMGVQLVSGLSLSVRTKKANKNILVGTATAYDTLQAHKQ

I checked, and this gene is expressed in some cells in the object, its expression is not 0.

My command was python3 /path/to/train-saturn.py --in_data /path/to/all_species_run.csv \ --in_label_col=cell_type --ref_label_col=cell_type \ --num_macrogenes=2000 --hv_genes=8000 \ # or --num_macrogenes=3000 --hv_genes=12000 \ --centroids_init_path=/path/to/saturn_results//all_species_centroids.pkl \ --score_adata --ct_map_path=/path/to/cell_type_map.csv \ --work_dir=/path/to/work_dir/

What could be causing the SATURN to fail when the number of genes is increased?

Yanay1 commented 5 months ago

Try changing your centroids_init_path to a different file. It might be trying to use the one that was generated for a smaller number of genes (thus causing this error).

matejasoretic commented 5 months ago

Yes, that was the issue, thank you! Closing the issue.