Open aopisco opened 5 years ago
Hi @aopisco , could you please share some information about your setup?
import batchglm
print(batchglm.__version__)
import diffxpy
print(diffxpy.__version__)
Also, do you use sparse AnnData or dense?
You are already using a z-test, so there should be only one model fitting necessary.
Therefore, if I had to guess, I'd assume that you are using a sparse AnnData object.
This can really slow down calculations, so since your dataset is not very large it should not be a problem to convert it into a dense array (tiss.X = tiss.X.toarray()
)
Beside of that, what hardware are you using? Did you read the performance guide / install optimized versions of Tensorflow and NumPy?
Hi @aopisco, thanks for reporting the issue! I am about to roll out a new version of the backend (batchglm), latest first week of January, this will also fix some remaining run time bottlenecks. Right now training takes long in some cases because the optimizer hyperparameters are not ideal yet for all scenarios, this will be improved in the new batchglm version. Would be great if you could report the versions and your setup in any case! If you havent optimzed tensorflow yet, dont do it just yet - it takes a long time in many cases and I have a feeling that this is a different issue.
@Hoeze changing to dense() made a huge difference, thanks for the suggestion. regarding versions I'm using
import batchglm
print(batchglm.__version__)
v0.4.1+2.g63763e7
import diffxpy
print(diffxpy.__version__)
v0.4.2+49.g6f4ebc6
now I changed to test="wilcoxon"
it gives
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-60-ed807d591abd> in <module>()
4 test="wilcoxon",
5 # noise_model="nb",
----> 6 sample_description=sample_description
7 )
~/maca-scanpy/diffxpy/diffxpy/testing/base.py in pairwise(data, grouping, as_numeric, test, lazy, gene_names, sample_description, noise_model, pval_correction, size_factors, batch_size, training_strategy, quick_scale, dtype, keep_full_test_objs, **kwargs)
3477 quick_scale=quick_scale,
3478 dtype=dtype,
-> 3479 **kwargs
3480 )
3481 pvals[i, j] = de_test_temp.pval
~/maca-scanpy/diffxpy/diffxpy/testing/base.py in two_sample(data, grouping, as_numeric, test, gene_names, sample_description, noise_model, size_factors, batch_size, training_strategy, quick_scale, dtype, **kwargs)
3275 gene_names=gene_names,
3276 grouping=grouping,
-> 3277 dtype=dtype
3278 )
3279 else:
~/maca-scanpy/diffxpy/diffxpy/testing/base.py in wilcoxon(data, grouping, gene_names, sample_description, dtype)
3095 data=X.astype(dtype),
3096 grouping=grouping,
-> 3097 gene_names=gene_names,
3098 )
3099
~/maca-scanpy/diffxpy/diffxpy/testing/base.py in __init__(self, data, grouping, gene_names)
882
883 self._mean = np.mean(data, axis=0)
--> 884 self._pval = stats.wilcoxon_test(x0=x0.data, x1=x1.data)
885 self._logfc = np.log(np.mean(x1, axis=0)) - np.log(np.mean(x0, axis=0)).data
886 q = self.qval
~/maca-scanpy/diffxpy/diffxpy/stats/stats.py in wilcoxon_test(x0, x1)
70 y=x1[:, i].flatten(),
71 alternative='two-sided'
---> 72 ).pvalue for i in range(x0.shape[1])
73 ])
74 return pvals
~/maca-scanpy/diffxpy/diffxpy/stats/stats.py in <listcomp>(.0)
70 y=x1[:, i].flatten(),
71 alternative='two-sided'
---> 72 ).pvalue for i in range(x0.shape[1])
73 ])
74 return pvals
~/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py in mannwhitneyu(x, y, use_continuity, alternative)
4895 T = tiecorrect(ranked)
4896 if T == 0:
-> 4897 raise ValueError('All numbers are identical in mannwhitneyu')
4898 sd = np.sqrt(T * n1 * n2 * (n1+n2+1) / 12.0)
4899
ValueError: All numbers are identical in mannwhitneyu
@aopisco, I haven't forgotten this, I am finishing the new release of batchglm first and will address this in the new release of diffxpy after that.
@aopisco You could now again use the inital test z-test with nb noise, this should be fast/normal speed with the new optimizers. I will next look into the issue with wilcoxon.
@davidsebfischer do you have any plans for speeding up pairwise test?
currently I'm trying with an AnnData object with n_obs × n_vars = 1740 × 5829 but it is taking a really long long time
I'm using the same code as in your notebook: