scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 597 forks source link

Wilcoxon rank_genes_groups math domain error (Scanpy 1.4.3) #706

Open HKanenew opened 5 years ago

HKanenew commented 5 years ago

Hi guys,

Sorry I am new to github, I just posted this a comment on a closed thread (#530) and wasn't sure if that would be seen or not so I am just posting it up in as a new issue, just in case.

I am having trouble running the rank_gene_groups test using wilcoxon, every time I try it throws a math domain error. I think this might be a duplicate of #566.

As with the OP in #566 I have no trouble running the rank_genes_groups using t-test or logreg, the problem only arises when using method='wilcoxon'. For the example provided below adata is using one of our real data sets and I am using custom clusters but I have the same problem when replicating the tutorial workflow. My colleague has had the same experience when trying to work through the tutorial on his system, again using the most recent release of Scanpy.

I've made sure 'log_transformed' was being applied to include the #519 fix provided by a-munoz-rojas in the hope that this might help but no such luck, I get the same error either way. Any ideas on how to solve this?

Here is what I ran:

sc.tl.rank_genes_groups(adata, groupby='Custom_clusters_Leiden', groups=['NKT1','NKT2','NKT17','IL2+ aNKT1','aNKT2 & aNKT17','TNF- aNKT1'], reference='rest', method='wilcoxon', corr_method='benjamini-hochberg', log_transformed=True)
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

And here is the resulting error:

ValueError                                Traceback (most recent call last)
<ipython-input-117-a5ba74ea872c> in <module>
----> 1 sc.tl.rank_genes_groups(adata, groupby='Custom_clusters_Leiden', groups=['NKT1','NKT2','NKT17','IL2+ aNKT1','aNKT2 & aNKT17','TNF- aNKT1'], reference='rest', method='wilcoxon', corr_method='benjamini-hochberg', log_transformed=True)
      2 sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

~\Anaconda3\lib\site-packages\scanpy\tools\_rank_genes_groups.py in rank_genes_groups(adata, groupby, use_raw, groups, reference, n_genes, rankby_abs, key_added, copy, method, corr_method, log_transformed, **kwds)
    367 
    368                 scores[imask, :] = (scores[imask, :] - (ns[imask] * (n_cells + 1) / 2)) / sqrt(
--> 369                     (ns[imask] * (n_cells - ns[imask]) * (n_cells + 1) / 12))
    370                 scores[np.isnan(scores)] = 0
    371                 pvals = 2 * stats.distributions.norm.sf(np.abs(scores[imask,:]))

ValueError: math domain error

P.S I just want to say thank you for all the work on Scanpy, loving it so far!

Originally posted by @HKanenew in https://github.com/theislab/scanpy/issues/530#issuecomment-505305611

LuckyMD commented 5 years ago

Hi,

I've tried to reproduce this with scanpy 1.4.3+80.g740c557 on the pbmc68k_reduced dataset and it works for me. I did the following:

adata = sc.datasets.pbmc68k_reduced()                                                                                                                                                             
sc.pp.filter_cells(adata, min_counts=10)                                                                                                                                                          
sc.pp.filter_genes(adata, min_cells=5) 
sc.pp.normalize_per_cell(adata) 
sc.pp.log1p(adata)
sc.tl.rank_genes_groups(adata, groupby='bulk_labels', groups=['CD56+ NK', 'Dendritic', 'CD34+'], reference='rest', method='wilcoxon', corr_method='benjamini-hochberg', log_transformed=True)

The sc.tl.rank_genes_groups() call is taken more or less from your example. Could you check that this works for you, and otherwise provide a minimal reproducible example that I could test?

HKanenew commented 5 years ago

Hi LuckyMD,

Thanks for your response. I've been able to reproduce the code that you posted above in a fresh notebook with no issue. I'm not sure why it is working now and wasn't working before for the pbmc68k data set.

image

However I am still having trouble with my own data set, so perhaps it is related how I processed the data instead? I'm not sure.

I've sent you an email link to the h5ad file and ipynb file for MRE.

Thanks a mil!

LuckyMD commented 5 years ago

I'm afraid I have a limited bandwidth at the moment and won't be able to get around to this this week. I would suggest you check whether there are negative values in your dataset and whether there are some genes that are not expressed in your data (gene filtering) in the meantime.