Genetic algorithm struggles with distance matrices that have low uniformity leading to premature convergence

Because of issue #2 we had mostly been using high values for strata_size in tsp/utils/dst_matrx_helpers.py. Once that was fixed the value used was the default of 5. This led to much lower uniformity in the distribution of distances, something that the genetic algorithm struggled with.

Here's the distribution graph for strata_size=100: strata100

Here's the distribution graph for strata_size=5 (the default) strata5

ChatGPT had the following insights on how to measure uniformity:

1. Coefficient of Variation (CV)

The CV is the ratio of the standard deviation to the mean. For distance matrices:

A low CV indicates that most distances are close to the mean (uniformity). A high CV indicates a wide spread of distances (non-uniformity).

2. Skewness and Kurtosis

Skewness measures the asymmetry of the distance distribution, while kurtosis measures the 'tailedness':

Skewness close to 0 suggests a symmetrical distribution of distances. High kurtosis suggests heavy tails, indicating more extreme values.

3. Gini Coefficient

Traditionally used in economics to measure income distribution, the Gini coefficient can also measure the inequality among values in a frequency distribution:

A Gini coefficient close to 0 suggests very uniform distances. A Gini coefficient close to 1 suggests high inequality among distances.

4. Entropy

Entropy measures the randomness in the distribution:

High entropy indicates a more uniform and unpredictable distribution. Low entropy indicates some predictability, suggesting non-uniformity.

5. Clustering Coefficient

For a graph represented by the distance matrix, the clustering coefficient measures the degree to which nodes tend to cluster together:

A high clustering coefficient suggests a problem with clustered cities. A low clustering coefficient suggests uniformity.

6. Triangle Inequality Violations

Count how many times the triangle inequality is violated in the matrix:

Few violations suggest a more metric-like space, potentially more uniform. Many violations suggest a complex space, less uniform.

<<<

It also provided some code to test for uniformity but testing gave a value of 0.57 for strata_size=5 and 0.59 for strata_size=100. I'll work on improvements as it's essential to know this about the matrix that the algorithm(s) are trying to solve.

tekgrrl / coding_problems