tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 339 forks source link

Feature on TimeSeriesKmean: DTW_BaryCenterAverage #268

Open NimaSarajpoor opened 4 years ago

NimaSarajpoor commented 4 years ago

Hi,

I opened an issue before but closed it later and decided to say it here in "Feature Request."

I was wondering if you could modify the TimeSeriesKmean function such that it can accept weight (as a callable function) in its metric_params for calculating the dtw_barycenteraveraging.

So: metric_params = {'weights: ', my_function(data_points)}

So, it is a function that gets a set of data points (observations) and based on that calculates a weight vector and returns it. It gives the flexibility to the user to define a weight function and apply it throughout the clustering process.

(In my problem, for instance, I modified the centroid of the FINAL RESULT and see that it works better for me. However, if such modification can be applied throughout the whole clustering process (and just the final result), it might better enhance the final clusters and result.)

Best, Nima

GillesVandewiele commented 4 years ago

Not sure if I understand the question completely, but there is a sample_weight argument in the fit function of TimeSeriesKmean. Can't you precompute all weights with my_function before calling that and pass it to fit?

rtavenar commented 4 years ago

@GillesVandewiele I think we do have such a parameter for KernelKMeans but not for TimeSeriesKMeans. And I agree that this would be the correct way to implement it.

This should not be too difficult to implement since dtw_barycenter_averaging already accepts weights as input, so I guess it would be a matter of:

  1. add the new sample_weights argument to fit
  2. see in KernelKMeans how this argument is pre-processed and do the same
  3. call the barycenters with adequate weights
  4. use weights for inertia computation

Hence I tag this one as a good first issue: anyone willing to work on this should feel free to open a PR.

NimaSarajpoor commented 4 years ago

@GillesVandewiele @rtavenar

Thanks for the response. I took a look at the argument sample_weights for KernelKmeans fit method. According to my understanding, it seems it only accepts a pre-defined vector for weight.

However, in my case, the weights are changing. In other words, the weights of points in a cluster (to calculate its DBA) are calculated as a function of those points in that cluster and return an array with a length equal to its cluster size.

So, it would be nice if it can accept a function as well.

rtavenar commented 4 years ago

@Ninimama

I understand your point, yet:

  1. depending on the form of your weight computation function, I am not sure that the algorithm at stake in TimeSeriesKMeans would be guaranteed to converge
  2. We will definitely stick to scikit-learn API in this case, and in scikit-learn, sample_weights is assumed to be a vector of fixed weights.
NimaSarajpoor commented 4 years ago

@rtavenar

  1. I thought about the convergence problem. However, I think that is what a user should be worried about. So, if someone wants to employ weight function, they should either mathematically or by experiment show that the results are good and the problem can be converged. So, wouldn't it be a good idea to have such an ability that one can play with weights? The tslearn package can give a warning to the user that the problem might not get converged or if the number of iteration exceeds. Any opinion?

  2. Yes. I agree that using fixed sample_weights is a stable approach without being worried about the non-convergence error and make sure the result is reliable.

In the end, you are the expert here. So, you definitely know better than me. My field is in electrical engineering (power system) and I am a newbie in this area.

Thanks again for your responses.

GillesVandewiele commented 4 years ago

@Ninimama

I understand your point, yet:

1. depending on the form of your weight computation function, I am not sure that the algorithm at stake in `TimeSeriesKMeans` would be guaranteed to converge

2. We will definitely stick to `scikit-learn` API in this case, and in `scikit-learn`, `sample_weights` is assumed to be a vector of fixed weights.

I agree! Although it should be noted that there are some exceptions to this, e.g. the KNN can accept a string for the weights parameter (uniform or based on the distances). It can be a callable as well. While a sample_weight is indeed a vector of weights passed during the fit method.

rtavenar commented 4 years ago

But for knn, weights are just used at predict time, they are not involved in any fit time optimization.

Once again I feel that this could definitely break convergence which is not a desirable behavior.