univie-datamining-team3 / assignment2

Analysis of mobility data
MIT License
0 stars 0 forks source link

Distance measures: Evaluate other distance metrics #23

Closed rmitsch closed 6 years ago

rmitsch commented 6 years ago

Which distance metrics (other than L2, DTW) might be useful/performant?

Lumik7 commented 6 years ago

As we are using scipy.spatial.distance.cdist's function to calculate the distance we can simply evaluate some of the existing distances which are provided e.g.:

Y = cdist(XA, XB, 'minkowski', p=2.) Y = cdist(XA, XB, 'cityblock') Y = cdist(XA, XB, 'seuclidean', V=None)

In order to try most of them out you only have to call first

trips_cut_per_30_sec = Preprocessor.get_cut_trip_snippets_for_total(preprocessed_data, snippet_length=30, sensor_type="acceleration")

distance_metric="cityblock"
distance_matrix_n2 = Preprocessor.calculate_distance_for_n2(trips_cut_per_30_sec, metric=distance_metric)

Note that for some metrics you need additional parameters, for which the function interface of calculate_distance_for_n2 has to be wrapped or extended, e.g. see minkowski distance above. cdist also allows to implement your own distance function, see the docs

rmitsch commented 6 years ago

Since we haven't implemented other clustering algorithms yet, I'll cycle through all of sklearn's distance metrics + DTW and measure against the implemented kmeans-clustering. I think assuming the transport modes as ground truth of three separable clusters enables us to establish a baseline regarding usefulness of those easily available metrics.

Of course a different set of engineered features might benefit from other distance metrics - in this case we can still repeat the procedure using a different distance matrix as starting point.

Depends on #27.

rmitsch commented 6 years ago

Implemented as described in comment above. Visualized with (1) parallel coordinates and (2) a grouped barchart. I'd deem (1) to be a better visualization, but plotly apparently doesn't allow for either detail info on hover over a data series or a legend for correlating data series names to colors, so it's unfortunately kinda useless.

Preliminary evaluation against k-means shows that most distance functions lead to comparable results: distance_metric_evaluation_kmeans

Runtime was significantly higher for Mahalanobis and DTW, all other metrics were on par. Exact numbers/chart to follow.

Merged into master. Suggesting to close this issue.

Note: Data on Euclidean distances between individual sensor dimensions still missing.

rmitsch commented 6 years ago

I'll close. Reopen if you disagree.

Lumik7 commented 6 years ago

I agree, but we should not forget to put this figure in to the documentation

rmitsch commented 6 years ago

Added it.