theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
193 stars 23 forks source link

de.test.pairwise very slow #74

Open aopisco opened 5 years ago

aopisco commented 5 years ago

@davidsebfischer do you have any plans for speeding up pairwise test?

currently I'm trying with an AnnData object with n_obs × n_vars = 1740 × 5829 but it is taking a really long long time

I'm using the same code as in your notebook:

test = de.test.pairwise(
    data=tiss,
    grouping="batch",
    test="z-test",
    noise_model="nb",
    sample_description=sample_description)
Hoeze commented 5 years ago

Hi @aopisco , could you please share some information about your setup?

import batchglm
print(batchglm.__version__)
import diffxpy
print(diffxpy.__version__)

Also, do you use sparse AnnData or dense? You are already using a z-test, so there should be only one model fitting necessary. Therefore, if I had to guess, I'd assume that you are using a sparse AnnData object. This can really slow down calculations, so since your dataset is not very large it should not be a problem to convert it into a dense array (tiss.X = tiss.X.toarray())

Beside of that, what hardware are you using? Did you read the performance guide / install optimized versions of Tensorflow and NumPy?

davidsebfischer commented 5 years ago

Hi @aopisco, thanks for reporting the issue! I am about to roll out a new version of the backend (batchglm), latest first week of January, this will also fix some remaining run time bottlenecks. Right now training takes long in some cases because the optimizer hyperparameters are not ideal yet for all scenarios, this will be improved in the new batchglm version. Would be great if you could report the versions and your setup in any case! If you havent optimzed tensorflow yet, dont do it just yet - it takes a long time in many cases and I have a feeling that this is a different issue.

aopisco commented 5 years ago

@Hoeze changing to dense() made a huge difference, thanks for the suggestion. regarding versions I'm using

import batchglm
print(batchglm.__version__)
v0.4.1+2.g63763e7
import diffxpy
print(diffxpy.__version__)
v0.4.2+49.g6f4ebc6

now I changed to test="wilcoxon" it gives

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-60-ed807d591abd> in <module>()
      4     test="wilcoxon",
      5 #     noise_model="nb",
----> 6     sample_description=sample_description
      7 )

~/maca-scanpy/diffxpy/diffxpy/testing/base.py in pairwise(data, grouping, as_numeric, test, lazy, gene_names, sample_description, noise_model, pval_correction, size_factors, batch_size, training_strategy, quick_scale, dtype, keep_full_test_objs, **kwargs)
   3477                     quick_scale=quick_scale,
   3478                     dtype=dtype,
-> 3479                     **kwargs
   3480                 )
   3481                 pvals[i, j] = de_test_temp.pval

~/maca-scanpy/diffxpy/diffxpy/testing/base.py in two_sample(data, grouping, as_numeric, test, gene_names, sample_description, noise_model, size_factors, batch_size, training_strategy, quick_scale, dtype, **kwargs)
   3275             gene_names=gene_names,
   3276             grouping=grouping,
-> 3277             dtype=dtype
   3278         )
   3279     else:

~/maca-scanpy/diffxpy/diffxpy/testing/base.py in wilcoxon(data, grouping, gene_names, sample_description, dtype)
   3095         data=X.astype(dtype),
   3096         grouping=grouping,
-> 3097         gene_names=gene_names,
   3098     )
   3099 

~/maca-scanpy/diffxpy/diffxpy/testing/base.py in __init__(self, data, grouping, gene_names)
    882 
    883         self._mean = np.mean(data, axis=0)
--> 884         self._pval = stats.wilcoxon_test(x0=x0.data, x1=x1.data)
    885         self._logfc = np.log(np.mean(x1, axis=0)) - np.log(np.mean(x0, axis=0)).data
    886         q = self.qval

~/maca-scanpy/diffxpy/diffxpy/stats/stats.py in wilcoxon_test(x0, x1)
     70             y=x1[:, i].flatten(),
     71             alternative='two-sided'
---> 72         ).pvalue for i in range(x0.shape[1])
     73     ])
     74     return pvals

~/maca-scanpy/diffxpy/diffxpy/stats/stats.py in <listcomp>(.0)
     70             y=x1[:, i].flatten(),
     71             alternative='two-sided'
---> 72         ).pvalue for i in range(x0.shape[1])
     73     ])
     74     return pvals

~/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py in mannwhitneyu(x, y, use_continuity, alternative)
   4895     T = tiecorrect(ranked)
   4896     if T == 0:
-> 4897         raise ValueError('All numbers are identical in mannwhitneyu')
   4898     sd = np.sqrt(T * n1 * n2 * (n1+n2+1) / 12.0)
   4899 

ValueError: All numbers are identical in mannwhitneyu
davidsebfischer commented 5 years ago

@aopisco, I haven't forgotten this, I am finishing the new release of batchglm first and will address this in the new release of diffxpy after that.

davidsebfischer commented 5 years ago

@aopisco You could now again use the inital test z-test with nb noise, this should be fast/normal speed with the new optimizers. I will next look into the issue with wilcoxon.