Open MattConflitti opened 5 years ago
If you are working with GPS coordinates, wouldn't working with a metric designed for them (haversine) be a starting point to avoid arbitrary assignments?
@hammadmazhar1 Absolutely! Good point. This was more to prove a point, but I would encourage whomever decides to use this method to use the appropriate distance metric.
Was searching through the issues in this repo regarding soft clustering or making sure each point is assigned a cluster even if it is suboptimal. I played around with the methods prescribed in the documentation for soft clustering, but was getting some whacky results where data points across the plot were seen as the same cluster when those points were much closer to a different cluster (I know this is a known issue). Since my data deals with GPS coordinates, I do not want that behavior.
Depending on your dataset, a potential workaround is to do regular hard clustering and then assign the noise to the closest cluster after the fact based on smallest euclidean distance. this does bypass any of the linkage tree calculations, but seems to be more consistent at keeping local points together in a cluster rather than randomly assigning a point to the same cluster that is 100s of miles away, which wouldn't make sense when clustering based on GPS coordinates.
Seems to work for my use case and didn't see this suggested elsewhere so I wanted to leave it here as a reference.
By the way, I love that due to the nature of the algorithm this can pick up the road structure of a city and cluster it according to what lies along a street. Very cool.
Before (red is noise):
After: