vnmabus / dcor

Distance correlation and related E-statistics in Python
https://dcor.readthedocs.io
MIT License
145 stars 26 forks source link

Improve performance of pairwise distances computation #40

Open vnmabus opened 1 year ago

vnmabus commented 1 year ago

The computation of pairwise distances is the main bottleneck of the naive algorithm for distance covariance. Currently we use scipy's cdist for Numpy arrays, and a broadcasting computation in other case.

Any performance improvement to this function is thus well received.

Palash123-4 commented 14 hours ago

Using the 'dcor. rowwise' function, I found a way to compute the pairwise distances. The idea is to calculate the distances using the 'dcor. rowwise' function first and later allocate them in the symmetric distance matrix form.

I did implement it on the multivariate normal data. The Python implementation is as follows:

import numpy as np
from dcor import u_distance_correlation_sqr, rowwise
from scipy.spatial.distance import pdist, squareform
from scipy.stats import multivariate_normal
def fast_dcor( u, v):
    value = u_distance_correlation_sqr(u.astype('float'), v.astype('float'), method = 'avl')
    return value  
def main():
   np.bool = np.bool_
   matrix_size = 6
   mean_vector =  [2, 3, 5, 3, 2, 1]
    # np.random.seed(123)
    # A = 0.1 * np.random.rand(matrix_size, matrix_size)
   A = np.array([[0.4959073 , 0.1000873 , 0.09526673, 0.32539095, 0.14799651,
            0.14689728],
           [0.08218122, 0.1318435 , 0.38105886, 0.36172172, 0.36502285,
            0.24346033],
           [0.12722501, 0.30467872, 0.34359662, 0.10908805, 0.45987096,
            0.1476644 ],
           [0.38427917, 0.2076916 , 0.15757227, 0.32931607, 0.29685373,
            0.04350785],
           [0.10871352, 0.27666418, 0.25179949, 0.49252174, 0.22908678,
            0.16522994],
           [0.28722805, 0.1650361 , 0.17008386, 0.10228245, 0.01083518,
            0.43123006]])
   B = np.dot(A, A.transpose())
   n_samples = 1000
   mv = multivariate_normal( mean = mean_vector, cov = B)
   a = mv.rvs(size = n_samples, random_state = 123)
   a_transpose = np.array(a.T)
   distance_ = pdist(a_transpose, metric = fast_dcor)
   print(squareform(distance_))
   # optimized pairwise distance using distance correlation rowise function
   print("\n------Optimized pdist for pairwise distance correlation----------")
   pdist_opt = np.zeros((matrix_size,  matrix_size))
   for i in range(matrix_size):       
       if i < matrix_size - 1:
           a1 = a_transpose[0 : (matrix_size - 1 - i)] 
           a2 = a_transpose[ i + 1 : ]  
           r = rowwise(u_distance_correlation_sqr, a1, a2)           
           for j in range(1, matrix_size - i):
               pdist_opt[j - 1, j + i ] = r[j - 1]
               pdist_opt[j + i , j - 1] = r[j - 1]
               pass
           del a1, a2
           pass
       pass
   print(pdist_opt)
if __name__== '__main__' :
    main()

If you see the outcome I print both the result of scipy pdist and the fast one utilizing rowwise function