2023-05-31 Official Partner Meeting Agenda

erictsai1208 commented 1 year ago

Moderator: Eric Tsai Notetaker: Chen Lin

Weekly check-in:

Project Board
check if our current clustering is making sense -> demo of our process

Cluster result:

31.5% Local community? - lower population density, some houses, but we also see community centres, churches, schools, etc.
24.97% Residential area - seems like there are lots of standalone houses nearby
1.47% Work? - lots of tall buildings (offices) and stores packed in one region
25.82% Shopping centres - lots of parking lots, more stores packed in one region
16.23% Travel/major intersection - many highways, intersections and hotels nearby

Question:

Silhouette score does not seem to be useful for determining what is a good clustering. Are there any suggestions for evaluation metrics that we can use to perform hyperparameter tuning?

For fuzzy c-means (FCM) clustering, the best metrics for grid search are:

Partition Coefficient (PC): This measures the degree of membership sharing between data points and clusters. A higher PC indicates better clustering, with data points having higher membership to their respective clusters and lower membership to other clusters.
Xie–Beni Index: This measures the ratio of the within-cluster distance and the distance between the clusters. A lower Xie-Beni index indicates better clustering, with well-separated clusters and low within-cluster distances.
Separation Index (SI): This is defined as the ratio of the minimum distance between cluster centers to the maximum cluster radius. A higher SI indicates that clusters are well separated.
Fukuyama-Sugeno Index (FSI): This is an alternative to the Xie-Beni index. It is also a ratio of within-cluster and between-cluster distances. A lower FSI indicates better clustering.

These metrics specifically take into account the fuzzy membership assignments in FCM clustering, in addition to the usual notions of within-cluster vs between-cluster distances.

If we check the clusters empirically using the map, we could be identifying good clusters by chance since we only sample 10 locations for each cluster. Any suggestions for how we can improve the process?
We consider interpreting cluster results using Approach 2 highlighted here.
Discuss issue with potentially not having time to build model for Subway Canada.

CChCheChen commented 1 year ago

Action Item:

Sending the demo cluster store number and cluster label to Sitewise
Sending the list of features used to build the current best model

Meeting notes:

Try elbow method to see what are the optimal number of cluster (if this is applicable for FCMeans)
Optimal K value using "knee point detection algorithm"
Try population density (or similar) and daytime population density ratio for super urban area
daytime pop vs. residence
Consider $\color{red} {\text{within-cluster}}$ over between-cluster distances
Use test set to verify the modelled cluster, see if the same pattern of clusters still present
Approved to use unsupervised to supervised validation approach
No pressure for last model using Subway CAN

erictsai1208 commented 1 year ago

Cluster result discussion:

Look at distribution of some features to validate clusters.
NYC, San Francisco and other large cities will generally always form its own cluster.

mozhao0331 / Restaurant_Segmentation_Analysis

2023-05-31 Official Partner Meeting Agenda #87