Decide on clustering algorithm

univie-datamining-team3 / assignment2

Analysis of mobility data

MIT License

0 stars 0 forks source link

Decide on clustering algorithm #25

Closed rmitsch closed 6 years ago

rmitsch commented 6 years ago

...or evaluate several. Depends on #27.

rmitsch commented 6 years ago

tslearn offers a few clustering algorithms focusing on time series data. Might be worth a look.

rmitsch commented 6 years ago

I'd suggest soft-DTW k-means. It's included in tslearn's clustering module (see here for an example illustrating clustering with tslearn, it's pretty much identical to sklearn's idiom).

rmitsch commented 6 years ago

After reading up a bit on common approaches to cluster time series data there seem to be two main common directions:

Calculate difference between time series using DTW (or soft-DTW k-means or similar) and cluster with k-NN or hierarchical clustering methods.
Engineer features (e. g. with tsfresh, see #19), calculate distance between engineered features with more common distance metrics (e. g. L2 norm).

For reference: See e. g. here, here, and here.

I'd suggest we try both approaches and compare the results against the baseline clusters defined in #31 to see which one yields better results.

Lumik7 commented 6 years ago

HDBSCAN and PreDeCon are now working in the pipe: The interface works like that:

from models.cluster import ElkiPipe
import pandas as pd

data = pd.read_csv(data_dir, sep=";")
elki = ElkiPipe()

  # for predecon
        params = elki.get_parameters_for_predecon(param_eps = 10.0, param_minpts = 2,
                                                  param_delta = 0.1, param_lambda = 1,
                                                  param_kappa = 20.0)

  # for hdbscan
        params = elki.get_parameters_for_hdbscan(param_minpts=100)

    results = elki.run_elki(data, params, plot_path=FLAGS.vis_path)

merged