src-d / seriate

Optimal ordering of elements in a set given their distance matrix.
Other
16 stars 11 forks source link

Input to function seriate #4

Closed DeepaMahm closed 5 years ago

DeepaMahm commented 5 years ago

Hi,

In the set of comments given in seritae.py, it is mentioned that :param dists: Either a condensed pdist-like or a symmetric square distance matrix.

Does that mean a correlation matrix shouldn't be used as input? Should the correlation matrix be converted to a distance matrix?

Guillemdb commented 5 years ago

The fact that correlations can be negative could influence the calculation of the TSP using ortools, but you can do something like seriate(pdist(corr_matrix)) to solve that problem.

In the docs, pdist-like referes to using scipy.spatial.distance.pdist to process non-square distance matrix input before seriation.

vmarkovtsev commented 5 years ago

The TSP does not have a solution with negative values: we follow the corresponding cycle and reach the infinitely negative optimal loss.

The triangle inequality does not have to hold, though. So I don't think that the matrix must be positively defined.

DeepaMahm commented 5 years ago

@Guillemdb I tried the followig

import os
import pickle
import matplotlib.pyplot as plt
from pprint import pprint
from seriate import seriate
from scipy.spatial.distance import pdist

def serialize_data(f_input):
    if os.path.exists(f_input):
        with open(f_input, "rb") as f:
            # prior to seriation
            df = pickle.load(f)
            pprint(df.head())
            input_np = df.values   #np nd array correlation matrix
            dist = pdist(input_np)  #distance matrix

            # matplotlib
            fig, ax = plt.subplots()
            im = ax.imshow(input_np)
            fig.tight_layout()
            plt.show()

            # seriation
            idx = seriate(dist, timeout=50)
            fig1, ax1 = plt.subplots()
            im1 = ax1.imshow(input_np[idx])
            fig1.tight_layout()
            plt.show()

if __name__ == '__main__':
    f_input = #input
    serialize_data(f_input)

This is the plot of the input data containing the correlation matrix.

This is the plot of the seriated data. I could observe streaks of blue patterns. However, these streaks aren't grouped together. I expect these streaks to be grouped :(

Guillemdb commented 5 years ago

To me your output looks fine. Probably this is as grouped as they should be, it is normal to have this kind of results when working with such big matrices.

DeepaMahm commented 5 years ago

@Guillemdb Many thanks for the response. Shouldn't the diagonal remain unchanged? Before seriation, I could see a yellow pattern. After seriation, the rows are sorted according to the Euclidean distance.

Would it be a good idea to sort the columns as well? Since the diagonal entries of the correlation matrix are expected to exhibit high correlation, I am a bit confused.

DeepaMahm commented 5 years ago

I came across a post on SO that suggests sorting both columns and rows of the correlation matrix.