Open vnmabus opened 1 year ago
Using the 'dcor. rowwise' function, I found a way to compute the pairwise distances. The idea is to calculate the distances using the 'dcor. rowwise' function first and later allocate them in the symmetric distance matrix form.
I did implement it on the multivariate normal data. The Python implementation is as follows:
import numpy as np
from dcor import u_distance_correlation_sqr, rowwise
from scipy.spatial.distance import pdist, squareform
from scipy.stats import multivariate_normal
def fast_dcor( u, v):
value = u_distance_correlation_sqr(u.astype('float'), v.astype('float'), method = 'avl')
return value
def main():
np.bool = np.bool_
matrix_size = 6
mean_vector = [2, 3, 5, 3, 2, 1]
# np.random.seed(123)
# A = 0.1 * np.random.rand(matrix_size, matrix_size)
A = np.array([[0.4959073 , 0.1000873 , 0.09526673, 0.32539095, 0.14799651,
0.14689728],
[0.08218122, 0.1318435 , 0.38105886, 0.36172172, 0.36502285,
0.24346033],
[0.12722501, 0.30467872, 0.34359662, 0.10908805, 0.45987096,
0.1476644 ],
[0.38427917, 0.2076916 , 0.15757227, 0.32931607, 0.29685373,
0.04350785],
[0.10871352, 0.27666418, 0.25179949, 0.49252174, 0.22908678,
0.16522994],
[0.28722805, 0.1650361 , 0.17008386, 0.10228245, 0.01083518,
0.43123006]])
B = np.dot(A, A.transpose())
n_samples = 1000
mv = multivariate_normal( mean = mean_vector, cov = B)
a = mv.rvs(size = n_samples, random_state = 123)
a_transpose = np.array(a.T)
distance_ = pdist(a_transpose, metric = fast_dcor)
print(squareform(distance_))
# optimized pairwise distance using distance correlation rowise function
print("\n------Optimized pdist for pairwise distance correlation----------")
pdist_opt = np.zeros((matrix_size, matrix_size))
for i in range(matrix_size):
if i < matrix_size - 1:
a1 = a_transpose[0 : (matrix_size - 1 - i)]
a2 = a_transpose[ i + 1 : ]
r = rowwise(u_distance_correlation_sqr, a1, a2)
for j in range(1, matrix_size - i):
pdist_opt[j - 1, j + i ] = r[j - 1]
pdist_opt[j + i , j - 1] = r[j - 1]
pass
del a1, a2
pass
pass
print(pdist_opt)
if __name__== '__main__' :
main()
If you see the outcome I print both the result of scipy pdist and the fast one utilizing rowwise function
The computation of pairwise distances is the main bottleneck of the naive algorithm for distance covariance. Currently we use scipy's cdist for Numpy arrays, and a broadcasting computation in other case.
Any performance improvement to this function is thus well received.