mdhaber / scipy

Scipy library main repository
http://scipy.org/scipylib/
BSD 3-Clause "New" or "Revised" License
1 stars 5 forks source link

ENH: vectorize cramervonmises_2samp #58

Open jb-leger opened 3 years ago

jb-leger commented 3 years ago

@mdhaber: A improvment of cramer von mises test with vectorization. I think of this vectorization when I was reviewing your PR. This code depends on _all_partitions, therefore I submit a PR to your branch. I can wait the merge to scipy/master and propose a PR on scipy if you prefer (and in this case, please close this PR).

Before vectorization:

In [1]: import numpy as np
   ...: import scipy.stats as stats
   ...: np.random.seed(0)
   ...: a = np.random.uniform(size=10)
   ...: b = np.random.uniform(size=11)

In [2]: stats.cramervonmises_2samp(a, b, method='exact')
Out[2]: CramerVonMisesResult(statistic=0.12236652236652246, pvalue=0.5228767620408488)

In [3]: %timeit stats.cramervonmises_2samp(a, b, method='exact')
4.79 s ± 88.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After vectorization:

In [1]: import numpy as np
   ...: import scipy.stats as stats
   ...: np.random.seed(0)
   ...: a = np.random.uniform(size=10)
   ...: b = np.random.uniform(size=11)

In [2]: stats.cramervonmises_2samp(a, b, method='exact')
Out[2]: CramerVonMisesResult(statistic=0.12236652236652246, pvalue=0.5228767620408488)

In [3]: %timeit stats.cramervonmises_2samp(a, b, method='exact')
1.67 s ± 25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With this improvment, maybe the threshold (now it is 10, for the sum between nx and ny) for switching from exact mode to asymptotic mode for method="auto" can be adapted.

mdhaber commented 3 years ago

I think it's a good idea. If we were to go this route, I would consider changing _all_partitions from a generator function to a function that returns a 2D array containing all the partitions in one shot. That would encourage future tests to vectorize these sorts of computations, too. (The only reason I made it a generator function in gh-13661 is because cramervonmises_2samp wasn't vectorized, and I didn't want to mess with that at the time.)

On the other hand, since it's just based on ranks, I think there are much more efficient methods for calculating the exact distribution of the statistic under the null hypothesis (e.g. using recursion), if you're interested. If you go that route, I'd recommend implementing the distribution in a class. gh-12531 shows a non-vectorized example. See _PageL. gh-4933 is vectorized as best as I could. See _MWU.

Thanks for thinking of this!

jb-leger commented 3 years ago

I think it's a good idea. If we were to go this route, I would consider changing _all_partitions from a generator function to a function that returns a 2D array containing all the partitions in one shot. That would encourage future tests to vectorize these sorts of computations, too.

In fact, generating a array with pythran can be very efficient, without a huge cost.

Let see:

In [1]: import meth1, meth2, meth3                                                                               

In [2]: m1 = meth1.get_indices_permutation_test(21, 10) 
   ...: m2 = meth2.get_indices_permutation_test(21, 10) 
   ...: m3 = meth3.get_indices_permutation_test(21, 10)                                                          

In [3]: (m1==m2).all()                                                                                           
Out[3]: True

In [4]: (m1==m3).all()                                                                                           
Out[4]: True

In [5]: %timeit meth1.get_indices_permutation_test(21, 10)                                                       
4.39 s ± 89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit meth2.get_indices_permutation_test(21, 10)                                                       
2.66 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit meth3.get_indices_permutation_test(21, 10)                                                       
47.5 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is quick a dirty experiment. If it is good for you, I will do the same for permutation (for the random samples where the performed test is not exact).

jb-leger commented 3 years ago

On the other hand, since it's just based on ranks, I think there are much more efficient methods for calculating the exact distribution of the statistic under the null hypothesis (e.g. using recursion),

I don't know if we have recursion formula for this distribution. Do you have some results for this?

mdhaber commented 3 years ago

It is quick a dirty experiment. If it is good for you, I will do the same for permutation (for the random samples where the performed test is not exact).

Yeah! That would be great.

I don't know if we have recursion formula for this distribution. Do you have some results for this?

I haven't looked. We can leave that to the future. Compilation will be useful for the random permutation tests, so in that case it makes sense to compile both.

Let's wait until gh-13661 is merged, which will need to wait for another stats maintainer. After that, if you submit a PR with these improvements, I can review and merge it.

In that case, I'm going to leave the code alone in gh-13661 - not replace with the set difference technique - because I think it will speed up review. I didn't write that mask-based code; I just moved it from cramervonmises_2samp, and I think it is easier to review the code if it hasn't changed.

mdhaber commented 3 years ago

Looks like there is some information about generating the CVM distribution efficiently here.