check if our current clustering is making sense -> demo of our process
Cluster result:
31.5%
Local community? - lower population density, some houses, but we also see community centres, churches, schools, etc.
24.97%
Residential area - seems like there are lots of standalone houses nearby
1.47%
Work? - lots of tall buildings (offices) and stores packed in one region
25.82%
Shopping centres - lots of parking lots, more stores packed in one region
16.23%
Travel/major intersection - many highways, intersections and hotels nearby
Question:
Silhouette score does not seem to be useful for determining what is a good clustering. Are there any suggestions for evaluation metrics that we can use to perform hyperparameter tuning?
For fuzzy c-means (FCM) clustering, the best metrics for grid search are:
Partition Coefficient (PC): This measures the degree of membership sharing between data points and clusters. A higher PC indicates better clustering, with data points having higher membership to their respective clusters and lower membership to other clusters.
Xie–Beni Index: This measures the ratio of the within-cluster distance and the distance between the clusters. A lower Xie-Beni index indicates better clustering, with well-separated clusters and low within-cluster distances.
Separation Index (SI): This is defined as the ratio of the minimum distance between cluster centers to the maximum cluster radius. A higher SI indicates that clusters are well separated.
Fukuyama-Sugeno Index (FSI): This is an alternative to the Xie-Beni index. It is also a ratio of within-cluster and between-cluster distances. A lower FSI indicates better clustering.
These metrics specifically take into account the fuzzy membership assignments in FCM clustering, in addition to the usual notions of within-cluster vs between-cluster distances.
If we check the clusters empirically using the map, we could be identifying good clusters by chance since we only sample 10 locations for each cluster. Any suggestions for how we can improve the process?
We consider interpreting cluster results using Approach 2 highlighted here.
Discuss issue with potentially not having time to build model for Subway Canada.
Sending the demo cluster store number and cluster label to Sitewise
Sending the list of features used to build the current best model
Meeting notes:
Try elbow method to see what are the optimal number of cluster (if this is applicable for FCMeans) Optimal K value using "knee point detection algorithm"
Try population density (or similar) and daytime population density ratio for super urban area daytime pop vs. residence
Consider $\color{red} {\text{within-cluster}}$ over between-cluster distances
Use test set to verify the modelled cluster, see if the same pattern of clusters still present
Approved to use unsupervised to supervised validation approach
Moderator: Eric Tsai Notetaker: Chen Lin
Weekly check-in:
Cluster result:
Question:
For fuzzy c-means (FCM) clustering, the best metrics for grid search are:
These metrics specifically take into account the fuzzy membership assignments in FCM clustering, in addition to the usual notions of within-cluster vs between-cluster distances.