Open jb-leger opened 3 years ago
I think it's a good idea. If we were to go this route, I would consider changing _all_partitions
from a generator function to a function that returns a 2D array containing all the partitions in one shot. That would encourage future tests to vectorize these sorts of computations, too.
(The only reason I made it a generator function in gh-13661 is because cramervonmises_2samp
wasn't vectorized, and I didn't want to mess with that at the time.)
On the other hand, since it's just based on ranks, I think there are much more efficient methods for calculating the exact distribution of the statistic under the null hypothesis (e.g. using recursion), if you're interested. If you go that route, I'd recommend implementing the distribution in a class.
gh-12531 shows a non-vectorized example. See _PageL
.
gh-4933 is vectorized as best as I could. See _MWU
.
Thanks for thinking of this!
I think it's a good idea. If we were to go this route, I would consider changing _all_partitions from a generator function to a function that returns a 2D array containing all the partitions in one shot. That would encourage future tests to vectorize these sorts of computations, too.
In fact, generating a array with pythran can be very efficient, without a huge cost.
Let see:
meth1.py
which is basically what we are doing now, meth2.py
is pure-python (but pythran compatible code) not compiled with pythranmeth3.py
is just a symlink to meth2.py
but compiled with pythran.In [1]: import meth1, meth2, meth3
In [2]: m1 = meth1.get_indices_permutation_test(21, 10)
...: m2 = meth2.get_indices_permutation_test(21, 10)
...: m3 = meth3.get_indices_permutation_test(21, 10)
In [3]: (m1==m2).all()
Out[3]: True
In [4]: (m1==m3).all()
Out[4]: True
In [5]: %timeit meth1.get_indices_permutation_test(21, 10)
4.39 s ± 89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit meth2.get_indices_permutation_test(21, 10)
2.66 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit meth3.get_indices_permutation_test(21, 10)
47.5 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is quick a dirty experiment. If it is good for you, I will do the same for permutation
(for the random samples where the performed test is not exact).
On the other hand, since it's just based on ranks, I think there are much more efficient methods for calculating the exact distribution of the statistic under the null hypothesis (e.g. using recursion),
I don't know if we have recursion formula for this distribution. Do you have some results for this?
It is quick a dirty experiment. If it is good for you, I will do the same for permutation (for the random samples where the performed test is not exact).
Yeah! That would be great.
I don't know if we have recursion formula for this distribution. Do you have some results for this?
I haven't looked. We can leave that to the future. Compilation will be useful for the random permutation tests, so in that case it makes sense to compile both.
Let's wait until gh-13661 is merged, which will need to wait for another stats maintainer. After that, if you submit a PR with these improvements, I can review and merge it.
In that case, I'm going to leave the code alone in gh-13661 - not replace with the set difference technique - because I think it will speed up review. I didn't write that mask-based code; I just moved it from cramervonmises_2samp
, and I think it is easier to review the code if it hasn't changed.
@mdhaber: A improvment of cramer von mises test with vectorization. I think of this vectorization when I was reviewing your PR. This code depends on
_all_partitions
, therefore I submit a PR to your branch. I can wait the merge to scipy/master and propose a PR on scipy if you prefer (and in this case, please close this PR).Before vectorization:
After vectorization:
With this improvment, maybe the threshold (now it is 10, for the sum between
nx
andny
) for switching from exact mode to asymptotic mode formethod="auto"
can be adapted.