sktime interface - Githubissues

mloning commented 3 years ago

Hi everyone,

Your package looks really good and we're thinking about interfacing your package in sktime to make use of the time series distance functions you provide, and potentially also the clustering algorithms (see https://github.com/alan-turing-institute/sktime/issues/501).

Is there anything we should keep in mind when adding dtaidistance as a dependency?
Have you played around with integrating it with sklearn (I see that you largely follow their interface)?
In clustering, what is your expected input type for the time series data in fit and predict, 2d numpy array assuming multiple instances of equal-length univariate data?

cc @chrisholder

wannesm commented 3 years ago

Hi @mloning ,

Sounds interesting and exciting. We started using sktime for some research tasks and teaching assignments and are satisfied users. So we are willing to provide updates and features to make this process easier. Our current focus is on our own projects (i.e. fast C-versions), and while we try to make it reusable, some functionality might be missing because we didn't yet need it (e.g. we don't have a predict method for clustering although that's easy to implement).

W.r.t. your questions:

dtaidistance as a dependency: I don't think there is anything special. We have two important optional dependencies: Cython and Numpy, which you already require (optionally). For clustering we rely on Scipy and PyClustering for some methods (optional and dynamically checked). We also selected a permissive license to make it easy to collaborate.
We often have sklearn or one of our own ML algorithms in the same pipeline. But it is typically communicating only through simple datastructures. For example, we use DTW to compute distances to prototypes and use the results as features. Or we feed it to scipy, pyclustering or our own clustering algorithms (e.g. for fleet-based anomaly detection). Currently we use agglomerative clustering (using own implementation or scipy), medoid clustering (using pyclustering) and, since this month, DBA-k-means (own implementation). We do indeed follow the sklearn interface since everybody is already familiar with this api (also inspired by the now moved out sklearn HMM module). Did you have anything else in mind than clustering for classification?
The expected input is quite flexible because our use cases also produce series with different lengths and we support embedded devices. We try to be smart about it and detect the type of input format (numpy, array.array, list, ...). If it's an equal-length numpy matrix, we use the internal datastructure directly. If it's a list of series, we keep track of individual pointers to the series. We thus avoid copy operations where we can. It can also be univariate or multivariate, both are supported for our use cases (e.g. for sports monitoring).

mloning commented 3 years ago

Thanks @wannesm, sounds good! @chrisholder will start working on it over the next few weeks, we'll report back if we have more questions!

wannesm / dtaidistance

sktime interface #98