Closed DeepaMahm closed 5 years ago
The fact that correlations can be negative could influence the calculation of the TSP using ortools, but you can do something like seriate(pdist(corr_matrix))
to solve that problem.
In the docs, pdist-like
referes to using scipy.spatial.distance.pdist
to process non-square distance matrix input before seriation.
The TSP does not have a solution with negative values: we follow the corresponding cycle and reach the infinitely negative optimal loss.
The triangle inequality does not have to hold, though. So I don't think that the matrix must be positively defined.
@Guillemdb I tried the followig
import os
import pickle
import matplotlib.pyplot as plt
from pprint import pprint
from seriate import seriate
from scipy.spatial.distance import pdist
def serialize_data(f_input):
if os.path.exists(f_input):
with open(f_input, "rb") as f:
# prior to seriation
df = pickle.load(f)
pprint(df.head())
input_np = df.values #np nd array correlation matrix
dist = pdist(input_np) #distance matrix
# matplotlib
fig, ax = plt.subplots()
im = ax.imshow(input_np)
fig.tight_layout()
plt.show()
# seriation
idx = seriate(dist, timeout=50)
fig1, ax1 = plt.subplots()
im1 = ax1.imshow(input_np[idx])
fig1.tight_layout()
plt.show()
if __name__ == '__main__':
f_input = #input
serialize_data(f_input)
This is the plot of the input data containing the correlation matrix.
This is the plot of the seriated data. I could observe streaks of blue patterns. However, these streaks aren't grouped together. I expect these streaks to be grouped :(
To me your output looks fine. Probably this is as grouped as they should be, it is normal to have this kind of results when working with such big matrices.
@Guillemdb Many thanks for the response. Shouldn't the diagonal remain unchanged? Before seriation, I could see a yellow pattern. After seriation, the rows are sorted according to the Euclidean distance.
Would it be a good idea to sort the columns as well? Since the diagonal entries of the correlation matrix are expected to exhibit high correlation, I am a bit confused.
I came across a post on SO that suggests sorting both columns and rows of the correlation matrix.
Hi,
In the set of comments given in seritae.py, it is mentioned that
:param dists: Either a condensed pdist-like or a symmetric square distance matrix.
Does that mean a correlation matrix shouldn't be used as input? Should the correlation matrix be converted to a distance matrix?