Closed rmitsch closed 6 years ago
As we are using scipy.spatial.distance.cdist's function to calculate the distance we can simply evaluate some of the existing distances which are provided e.g.:
Y = cdist(XA, XB, 'minkowski', p=2.) Y = cdist(XA, XB, 'cityblock') Y = cdist(XA, XB, 'seuclidean', V=None)
In order to try most of them out you only have to call first
trips_cut_per_30_sec = Preprocessor.get_cut_trip_snippets_for_total(preprocessed_data, snippet_length=30, sensor_type="acceleration")
distance_metric="cityblock"
distance_matrix_n2 = Preprocessor.calculate_distance_for_n2(trips_cut_per_30_sec, metric=distance_metric)
Note that for some metrics you need additional parameters, for which the function interface of calculate_distance_for_n2
has to be wrapped or extended, e.g. see minkowski distance above. cdist
also allows to implement your own distance function, see the docs
Since we haven't implemented other clustering algorithms yet, I'll cycle through all of sklearn's distance metrics + DTW and measure against the implemented kmeans-clustering. I think assuming the transport modes as ground truth of three separable clusters enables us to establish a baseline regarding usefulness of those easily available metrics.
Of course a different set of engineered features might benefit from other distance metrics - in this case we can still repeat the procedure using a different distance matrix as starting point.
Depends on #27.
Implemented as described in comment above. Visualized with (1) parallel coordinates and (2) a grouped barchart. I'd deem (1) to be a better visualization, but plotly apparently doesn't allow for either detail info on hover over a data series or a legend for correlating data series names to colors, so it's unfortunately kinda useless.
Preliminary evaluation against k-means shows that most distance functions lead to comparable results:
Runtime was significantly higher for Mahalanobis and DTW, all other metrics were on par. Exact numbers/chart to follow.
Merged into master. Suggesting to close this issue.
Note: Data on Euclidean distances between individual sensor dimensions still missing.
I'll close. Reopen if you disagree.
I agree, but we should not forget to put this figure in to the documentation
Added it.
Which distance metrics (other than L2, DTW) might be useful/performant?