method run_gsea() error : SystemError: CPUDispatcher(<function nb_gsea at 0x7f7477d3b9c0>) returned a result with an exception set

nicolas-zimmermann commented 7 months ago

Describe the bug Hello everyone, I tryed to execute the run_gsea() method doing the following : dc.run_gsea(pdata, genesets, use_raw=False)

pdata, AnnData object : output of dc.get_pseudobulk(), of size n_obs x n_var 26 * 26185
genesets, pd.DataFrame : long net format with 3 columns "source", "target" and "weigth". There are two gene sets, names repeted in source and target column containg gene symbols.

Doing so I obtain the following error message :

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
File ~/miniforge-pypy3/envs/singlecell/lib/python3.11/site-packages/numba/core/serialize.py:46, in _numba_unpickle(address, bytedata, hashed)
     45 try:
---> 46     obj = _unpickled_memo[key]
     47 except KeyError:

File ~/miniforge-pypy3/envs/singlecell/lib/python3.11/site-packages/numba/core/serialize.py:46, in _numba_unpickle(address, bytedata, hashed)
     45 try:
---> 46     obj = _unpickled_memo[key]
     47 except KeyError:

    [... skipping similar frames: _numba_unpickle at line 46 (4 times)]

File ~/miniforge-pypy3/envs/singlecell/lib/python3.11/site-packages/numba/core/serialize.py:46, in _numba_unpickle(address, bytedata, hashed)
     45 try:
---> 46     obj = _unpickled_memo[key]
     47 except KeyError:

ZeroDivisionError: division by zero

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
Cell In[41], line 1
----> 1 dc.run_gsea(ckd_vs_ref, genesets, use_raw=False)

File ~/miniforge-pypy3/envs/singlecell/lib/python3.11/site-packages/decoupler/method_gsea.py:356, in run_gsea(mat, net, source, target, times, batch_size, min_n, seed, verbose, use_raw)
    353     print('Running gsea on mat with {0} samples and {1} targets for {2} sources.'.format(m.shape[0], len(c), len(net)))
    355 # Run GSEA
--> 356 estimate, norm_e, pvals = gsea(m, net, times=times, seed=seed, verbose=verbose)
    358 # Transform to df
    359 estimate = pd.DataFrame(estimate, index=r, columns=net.index)

File ~/miniforge-pypy3/envs/singlecell/lib/python3.11/site-packages/decoupler/method_gsea.py:199, in gsea(mat, net, times, seed, verbose)
    196         row = mat[i]
    198     # Compute GSEA per row
--> 199     es[i], nes[i], pvals[i], _, _, _, _ = nb_gsea(row, net, starts, offsets, times, seed, False)
    201 if times != 0:
    202     return es, nes, pvals

SystemError: CPUDispatcher(<function nb_gsea at 0x7f955f783920>) returned a result with an exception set

Expected behavior I do not understand the error, maybe the input I've given to the function is wrong ?

System I'm running this using a mamba environment on Ubuntu 20.04, my CPU is an Intel Xeon E5-2650 v2 and my softwares version are :

decoupler 1.5.0
anndata 0.10.2
numba 0.57.1

Thank you in advance for your time Best, Nicolas Zimmermann

PauBadiaM commented 7 months ago

Hi @nicolas-zimmermann, You got a ZeroDivisionError, this means that when performing permutations to obtain the normalized score you sampled only zeros, try to increase the number of permutations, also maybe check that you have non-zero values in your input. Let me know how it goes.

nicolas-zimmermann commented 7 months ago

Hi @PauBadiaM, Thank you for your reactivity ! Increasing the number of permutations (10 000 then 100 000) didn't had an effect. I have a few zero values in my input data, which is a pseudobulk obtained with the sum mode :

    pdata = dc.get_pseudobulk(
        adata,
        sample_col='patient',
        groups_col='condition.l1',
        layer='counts',
        mode='sum',
        min_cells=20,
        min_counts=300
    )

What format of data does run_gsea exepect when given an AnnData object ? Best, Nicolas

PauBadiaM commented 7 months ago

Did you normalize your counts after pseudobulking? Maybe try that. It would also be good if you could share a small reproducible example so that I can debug it.

nicolas-zimmermann commented 7 months ago

Yes I did normalize it after the pseudobulk. Also I ran gsva on the same input and it didn't returned me any error. For the example, where should I send you the data ?

PauBadiaM commented 7 months ago

Another thing you can try is to change the seed parameter, maybe you are being very unlucky with the sampling. If you do not want to share it publicly you can send it to me via email at pau.badia {at} uni-heidelberg.de

PauBadiaM commented 7 months ago

Hi @nicolas-zimmermann ,

Thanks for sharing an example, I've localized where the error was coming from. In some cases, if there were a lot of 0s, a division by zero was happening and throwing the error. I've updated the code to handle this in 20764f09f5acfd85ee3bdb8414a30a552f1b53b6 You can update decoupler and try again:

pip install --upgrade git+https://github.com/saezlab/decoupler-py.git

BTW, I saw that you had a remove_absent_genes function, there is no need since decoupler handles this already under the hood. Hope this is helpful! Let me know if it does not work.

saezlab / decoupler-py

method run_gsea() error : SystemError: CPUDispatcher(<function nb_gsea at 0x7f7477d3b9c0>) returned a result with an exception set #88