snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
120 stars 15 forks source link

probabilities contain NaN in numpy.random.mtrand.RandomState.choice #33

Closed mcrewcow closed 2 months ago

mcrewcow commented 3 months ago

Hi, thank you for the package developed!

I have an issue running the tool on my dataset. It was mainly built in Seurat and later converted to .h5ad with SeuratDisk. First there was an error about the _index column, fixed with:

fetal_total.__dict__['_raw'].__dict__['_var'] = fetal_total.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'})

Now, with

python eval_single_anndata.py --adata_path=/mnt/c/Bioinf/HUMAN_FETAL_RETINA/COMBINED_EKPB_v1_clean_int_indexed.h5ad --dir=/mnt/c/Bioinf/HUMAN_FETAL_RETINA/ --species=human

I receive the following error. The output is provided:

[2024-03-30 03:33:45,762] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Using sample 4 layer model
Proccessing COMBINED_EKPB_v1_clean_int_indexed
10.0
COMBINED_EKPB_v1_clean_int_indexed (113073, 1861)
Wrote Shapes Dict
1861
Max Code: 612
Loaded model:
./model_files/4layer_model.torch
  0%|                                                                                                                                                     | 0/4523 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "eval_single_anndata.py", line 155, in <module>
    main(args, accelerator)
  File "eval_single_anndata.py", line 85, in main
    processor.run_evaluation()
  File "/mnt/c/Users/rodri/Downloads/UCE-main/UCE-main/evaluate.py", line 146, in run_evaluation
    self.starts_path, shapes_dict, self.accelerator, self.args)
  File "/mnt/c/Users/rodri/Downloads/UCE-main/UCE-main/evaluate.py", line 235, in run_eval
    for batch in pbar:
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/accelerate/data_loader.py", line 377, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/mnt/c/Users/rodri/Downloads/UCE-main/UCE-main/eval_data.py", line 65, in __getitem__
    dataset_to_starts=self.dataset_to_starts)
  File "/mnt/c/Users/rodri/Downloads/UCE-main/UCE-main/eval_data.py", line 128, in sample_cell_sentences
    replace=True)
  File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN

Thank you for your help!

Yanay1 commented 2 months ago

This error usually happens when you have cells with no genes expressed. Please double check that the .X slot contains count values, and that all cells have gene expression.

mcrewcow commented 2 months ago

Hi Yanay1,

I checked the object, it does have both .X and .raw.X slots. I can do all the downstream analysis upon h5seurat to h5ad conversion in scanpy, scvelo, scenic+, etc. I have also noticed that if I convert the 'integrated' assay of Seurat object to .h5ad, then it is the error I described above. Yet if I convert the 'RNA' assay, I get the following:

python eval_single_anndata.py --adata_path=/mnt/c/Bioinf/HUMAN_FETAL_RETINA/COMBINED_EKPB_v1_clean_RNA_indexed.h5ad --dir=/mnt/c/Bioinf/HUMAN_FETAL_RETINA/ --species=human --multi_gpu=True
[2024-04-01 15:54:41,230] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Using sample 4 layer model
Proccessing COMBINED_EKPB_v1_clean_RNA_indexed
Killed
mcrewcow commented 2 months ago

image

Yanay1 commented 2 months ago

What is the result of

min(np.sum(fetal_int.X, axis=0))

Try deleting all the intermediate files created by UCE and then re running

mcrewcow commented 2 months ago

Unfortunately, deleting the intermediate did not help.

This is the output of the command: -2559.4630599647617

Yanay1 commented 2 months ago

You cannot have negative numbers in .X. The expression values are used as probability weights. They should be count values.

mcrewcow commented 2 months ago

Oh, I have found that it is the issue of SeuratDisk conversion. The counts are written in .raw.X. So I have transferred them to .X now, it looks like real counts. I have also filtered the genes with min_counts = 40. Still, I keep getting

python eval_single_anndata.py --adata_path=/mnt/c/Bioinf/HUMAN_FETAL_RETINA/COMBINED_EKPB_v1_clean_countsonly_new_indexed_int_maybe.h5ad --dir=/mnt/c/Bioinf/HUMAN_FETAL
_RETINA/ --species=human
[2024-04-01 18:16:22,542] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Using sample 4 layer model
Proccessing COMBINED_EKPB_v1_clean_countsonly_new_indexed_int_maybe
7.616953
COMBINED_EKPB_v1_clean_countsonly_new_indexed_int_maybe (113073, 1861)
Wrote Shapes Dict
1861
Max Code: 612
Loaded model:
./model_files/4layer_model.torch
  0%|                                                                                          | 0/4523 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "eval_single_anndata.py", line 155, in <module>
    main(args, accelerator)
  File "eval_single_anndata.py", line 85, in main
    processor.run_evaluation()
  File "/mnt/c/Users/rodri/Downloads/UCE-main/UCE-main/evaluate.py", line 146, in run_evaluation
    self.starts_path, shapes_dict, self.accelerator, self.args)
  File "/mnt/c/Users/rodri/Downloads/UCE-main/UCE-main/evaluate.py", line 235, in run_eval
    for batch in pbar:
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/accelerate/data_loader.py", line 377, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/mcrewcow/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/mnt/c/Users/rodri/Downloads/UCE-main/UCE-main/eval_data.py", line 65, in __getitem__
    dataset_to_starts=self.dataset_to_starts)
  File "/mnt/c/Users/rodri/Downloads/UCE-main/UCE-main/eval_data.py", line 128, in sample_cell_sentences
    replace=True)
  File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Yanay1 commented 2 months ago

I am not sure what the error could be besides some cell either containing negative numbers, zero counts, or NaNs. If you want, you can email me a copy of the anndata and I can inspect it. My email is (first name) @ stanford.edu.

It seems the issue happens in the first batch.

Thanks!