scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 597 forks source link

Math domain error in rank_genes_groups function #530

Closed outlace closed 5 years ago

outlace commented 5 years ago

I get the error below when trying to run the following:

>>> sc.tl.rank_genes_groups(adata, 'louvain', groups=['5','16','19','30'], reference='0', method='wilcoxon')

C:\Users\myuser\Anaconda3\lib\site-packages\scanpy\tools\_rank_genes_groups.py:298: RuntimeWarning: overflow encountered in long_scalars
  (n_active * m_active * (n_active + m_active + 1) / 12))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-160-dd19114ff660> in <module>
      1 #adata.obs['groups'] = ['group 1'= ['0'], 'group 2'= ['5','16','19','30']]
----> 2 sc.tl.rank_genes_groups(adata, 'louvain', groups=['5','16','19','30'], reference='0', method='wilcoxon') # wilcoxon-rank-sum/mann-whitney u test, the default of Seurat

~\Anaconda3\lib\site-packages\scanpy\tools\_rank_genes_groups.py in rank_genes_groups(adata, groupby, use_raw, groups, reference, n_genes, rankby_abs, key_added, copy, method, corr_method, **kwds)
    296 
    297                 scores = (scores - (n_active * (n_active + m_active + 1) / 2)) / sqrt(
--> 298                     (n_active * m_active * (n_active + m_active + 1) / 12))
    299                 scores[np.isnan(scores)] = 0
    300                 pvals = 2 * stats.distributions.norm.sf(np.abs(scores))

ValueError: math domain error

Here adata is real data from our lab, not the tutorial data. Have been trying to replicate the cluster analysis tutorial. All previous steps work fine. Interestingly, if I remove group '5' from the list of groups it works. Also, this error only happens with the wilcoxon method, not with t-test.

ivirshup commented 5 years ago

This certainly looks strange. Would you mind making a minimal complete example? If you need, here's a useful guide to making one.

falexwolf commented 5 years ago

Sorry, there is a small bug in the wilcoxon method, that might hit sometimes. @a-munoz-rojas, it should be resolved after merging your fix, don't you think so? I'd be happy to move forward as soon as the code-overhead issue around double logarithmization is fixed. Should be very simple. Thank you!

a-munoz-rojas commented 5 years ago

Hi @falexwolf - I think it be resolved with the bug fix, but I'm not entirely sure. I just pushed the changes addressing the code-overhead issue, we can give it a try to see if it fixes it. Otherwise we'd have to take a closer look.

ivirshup commented 5 years ago

@outlace, any chance you've tried this since #519 got merged?

outlace commented 5 years ago

@ivirshup I haven't tried it yet, sorry. But you can close the issue. if I still have trouble when I get back to the analysis I can re-open it

flying-sheep commented 5 years ago

Great, thank you!

ivirshup commented 5 years ago

Oh no worries! I was just wondering if this had been fixed (not sure how to reproduce) since I think #566 is a duplicate issue.

HKanenew commented 5 years ago

Hi guys,

Sorry to re-open the thread but I am also getting the same error as described by the OP above with the latest release of Scanpy (v1.4.3).

As with #566 I have no trouble running the rank_genes_groups using t-test or logreg, the problem only arises when using method='wilcoxon'. For the example provided below adata is using one of our real data sets and I am using custom clusters but I have the same problem when replicating the tutorial workflow. My colleague has had the same experience when trying to work through the tutorial on his system, again using the most recent release of Scanpy.

I've made sure 'log_transformed' was being applied to include the #519 fix provided by a-munoz-rojas in the hope that this might help but no such luck, I get the same error either way. Any ideas?

Here is what I ran:

sc.tl.rank_genes_groups(adata, groupby='Custom_clusters_Leiden', groups=['NKT1','NKT2','NKT17','IL2+ aNKT1','aNKT2 & aNKT17','TNF- aNKT1'], reference='rest', method='wilcoxon', corr_method='benjamini-hochberg', log_transformed=True)
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

And here is the resulting error:

ValueError                                Traceback (most recent call last)
<ipython-input-117-a5ba74ea872c> in <module>
----> 1 sc.tl.rank_genes_groups(adata, groupby='Custom_clusters_Leiden', groups=['NKT1','NKT2','NKT17','IL2+ aNKT1','aNKT2 & aNKT17','TNF- aNKT1'], reference='rest', method='wilcoxon', corr_method='benjamini-hochberg', log_transformed=True)
      2 sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

~\Anaconda3\lib\site-packages\scanpy\tools\_rank_genes_groups.py in rank_genes_groups(adata, groupby, use_raw, groups, reference, n_genes, rankby_abs, key_added, copy, method, corr_method, log_transformed, **kwds)
    367 
    368                 scores[imask, :] = (scores[imask, :] - (ns[imask] * (n_cells + 1) / 2)) / sqrt(
--> 369                     (ns[imask] * (n_cells - ns[imask]) * (n_cells + 1) / 12))
    370                 scores[np.isnan(scores)] = 0
    371                 pvals = 2 * stats.distributions.norm.sf(np.abs(scores[imask,:]))

ValueError: math domain error

P.S I just want to say thank you for all the work on Scanpy, loving it so far!