vnmabus / dcor

Distance correlation and related E-statistics in Python
https://dcor.readthedocs.io
MIT License
144 stars 26 forks source link

Question: is there a fast method for `dcor.independence.distance_covariance_test` #30

Open mycarta opened 3 years ago

mycarta commented 3 years ago

WIth reference to the exampel in this notebook, this weekend I compared the performance of the the MERGESORT method vs. the NAIVE with a toy dataset of 8 columns x 21 rows:

%%timeit
dc = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, 
                                                          col2, method = 'NAIVE'), axis = 0, arr=data), axis =0, arr=data)
>>> 24.3 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

vs:

%%timeit
dc = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, 
                                                          col2, method = 'MERGESORT'), axis = 0, arr=data), axis =0, arr=data)
>>> 17.4 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Since i sometimes work with many thousands of rows, and possibly more columns, I wonder if there is a way to similarly improve the speed of the pairwise p-value calculation:

p = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.independence.distance_covariance_test(col1, 
                                                         col2, exponent=1.0, num_resamples=2000)[0], 
                                                         axis = 0, arr=data), axis =0, arr=data)
>>> 4.38 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vnmabus commented 3 years ago

No, as today. The code would need to have a separate branch to handle that case, but it should be relatively easy to implement (adding a new function in _hypothesis to perform a permutation test using the original array instead of the distance matrix, and using that when the method is not NAIVE). If you want to try a PR I could review it.

BTW, if you have additional CPUs you can use the 'AVL' method in distance_correlation and the rowwise function for an extra boost.

mycarta commented 3 years ago

I am at capacity until the fall. After the summer, if as I hope I will have more time, I can give it a try.

For the purposes of my current projects, for the time being I am going to decimate my array really heavily:

decimated_df = data.copy().sample(frac=0.05, random_state=1)