Because of issue #2 we had mostly been using high values for strata_size in tsp/utils/dst_matrx_helpers.py. Once that was fixed the value used was the default of 5. This led to much lower uniformity in the distribution of distances, something that the genetic algorithm struggled with.
Here's the distribution graph for strata_size=100:
Here's the distribution graph for strata_size=5 (the default)
ChatGPT had the following insights on how to measure uniformity:
1. Coefficient of Variation (CV)
The CV is the ratio of the standard deviation to the mean. For distance matrices:
A low CV indicates that most distances are close to the mean (uniformity).
A high CV indicates a wide spread of distances (non-uniformity).
2. Skewness and Kurtosis
Skewness measures the asymmetry of the distance distribution, while kurtosis measures the 'tailedness':
Skewness close to 0 suggests a symmetrical distribution of distances.
High kurtosis suggests heavy tails, indicating more extreme values.
3. Gini Coefficient
Traditionally used in economics to measure income distribution, the Gini coefficient can also measure the inequality among values in a frequency distribution:
A Gini coefficient close to 0 suggests very uniform distances.
A Gini coefficient close to 1 suggests high inequality among distances.
4. Entropy
Entropy measures the randomness in the distribution:
High entropy indicates a more uniform and unpredictable distribution.
Low entropy indicates some predictability, suggesting non-uniformity.
5. Clustering Coefficient
For a graph represented by the distance matrix, the clustering coefficient measures the degree to which nodes tend to cluster together:
A high clustering coefficient suggests a problem with clustered cities.
A low clustering coefficient suggests uniformity.
6. Triangle Inequality Violations
Count how many times the triangle inequality is violated in the matrix:
Few violations suggest a more metric-like space, potentially more uniform.
Many violations suggest a complex space, less uniform.
<<<
It also provided some code to test for uniformity but testing gave a value of 0.57 for strata_size=5 and 0.59 for strata_size=100. I'll work on improvements as it's essential to know this about the matrix that the algorithm(s) are trying to solve.
Because of issue #2 we had mostly been using high values for
strata_size
in tsp/utils/dst_matrx_helpers.py. Once that was fixed the value used was the default of5
. This led to much lower uniformity in the distribution of distances, something that the genetic algorithm struggled with.Here's the distribution graph for
strata_size=100
:Here's the distribution graph for
strata_size=5
(the default)ChatGPT had the following insights on how to measure uniformity:
A low CV indicates that most distances are close to the mean (uniformity). A high CV indicates a wide spread of distances (non-uniformity).
2. Skewness and Kurtosis
Skewness measures the asymmetry of the distance distribution, while kurtosis measures the 'tailedness':
Skewness close to 0 suggests a symmetrical distribution of distances. High kurtosis suggests heavy tails, indicating more extreme values.
3. Gini Coefficient
Traditionally used in economics to measure income distribution, the Gini coefficient can also measure the inequality among values in a frequency distribution:
A Gini coefficient close to 0 suggests very uniform distances. A Gini coefficient close to 1 suggests high inequality among distances.
4. Entropy
Entropy measures the randomness in the distribution:
High entropy indicates a more uniform and unpredictable distribution. Low entropy indicates some predictability, suggesting non-uniformity.
5. Clustering Coefficient
For a graph represented by the distance matrix, the clustering coefficient measures the degree to which nodes tend to cluster together:
A high clustering coefficient suggests a problem with clustered cities. A low clustering coefficient suggests uniformity.
6. Triangle Inequality Violations
Count how many times the triangle inequality is violated in the matrix:
Few violations suggest a more metric-like space, potentially more uniform. Many violations suggest a complex space, less uniform.
<<<
It also provided some code to test for uniformity but testing gave a value of 0.57 for
strata_size=5
and 0.59 forstrata_size=100
. I'll work on improvements as it's essential to know this about the matrix that the algorithm(s) are trying to solve.