jschulberg / DC-Transportation-Crashes

Analysis of transportation-related crashes (car, motorcycle, pedestrian, bike) in the Washington, D.C. area.
0 stars 0 forks source link

Geospatial Clustering: DBSCAN #5

Closed jschulberg closed 1 year ago

jschulberg commented 1 year ago

Use DBSCAN, a density-based clustering approach. In this approach, we pre-set the maximum distance that points can be set apart in order to be clustered together. The benefit here is that we do not need to pre-set the number of clusters we would expect. The rationale here is that it is difficult to pre-determine the number of clusters needed; instead, looking at the density of points on a map would be useful. To use the correct implementation of DBSCAN, we plan on following the approach used by Geoff Boeing in his paper Clustering to Reduce Spatial Data Set Size using the built-in ‘haversine’ metric, which takes into account curvature of the Earth so we can properly use Latitude and Longitude points.

To evaluate the results of DBSCAN, we plan on measure the silhouette score for various values of the main parameter, epsilon. In this case, the silhouette score measures the ratio of the average distance between points within a cluster (a) divided by the average distance between points within that cluster to the nearest next cluster (b). Thus, the formula is:

$$ \frac{b-a}{max(a,b)} $$

This will give us a good sense of how well-separated the clusters are for various values of epsilon.

jschulberg commented 1 year ago

Created a few maps and put them in the /Images folder. This is good to go.