slaypni / fastdtw

A Python implementation of FastDTW
MIT License
785 stars 122 forks source link

Add parallelization functions to the package #29

Open lvermue opened 5 years ago

lvermue commented 5 years ago

Full parallelization was added to the package using the joblib library. Now NxM matrices, i.e. N-time series with M-time points, can be calculated in parallel. To embed different lengths the missing time points can be padded with np.nan values.

The changes were tested on a machine with 20 cores leading to following results: Single core

from scipy.spatial.distance import euclidean
from time import time
from fastdtw import fastdtw

X = np.random.randint(1, 40, size=(100, 100))
dist_mat = np.zeros((X.shape[0], X.shape[0]))
indices = np.vstack(np.triu_indices(X.shape[0], 1)).T

start = time()

for row, column in indices:
    distance, path = fastdtw(X[row], X[column], dist=euclidean)
    dist_mat[row, column] = distance

print('It took {:.0f} seconds'.format(time()-start))

# It took 175 seconds

Parallel

import numpy as np
from scipy.spatial.distance import euclidean
from time import time
from fastdtw import fastdtw_parallel, get_path

# Using the X-matrix
X=X

start = time()

# Same machine with 20 cores
distance_matrix, path_list = fastdtw_parallel(X, dist=euclidean, n_jobs=-1)

print('It took {:.0f} seconds'.format(time()-start))

# It took 11 seconds

Examples on how to use the new functions were added to the README.rst file and the docstring of the respective functions.

slaypni commented 5 years ago

@lvermue Thank you for the PR! The execution time improvement looks significant.

Cloud you writes tests for the new functions? (Strikethroughed because of the following question)

slaypni commented 5 years ago

@lvermue Is just writing something like this code insufficient?

import itertools

from fastdtw import fastdtw
from joblib import Parallel, delayed
import numpy as np

X = np.random.randint(1, 40, size=(100, 100))
results = Parallel(n_jobs=-1)(delayed(fastdtw)(X[i], X[j]) for i, j in itertools.product(range(100), repeat=2))
distance_mat = np.array([r[0] for r in results]).reshape(100, 100)
lvermue commented 5 years ago

@slaypni There are two main aspects to this:

  1. The way it is written now it includes some optimization considerations, namely:
  1. It would not be as user-friendly, especially for less python versed users of this package.
slaypni commented 5 years ago

@lvermue As you mentioned, the simple script could reduce the execution time by half replacing itertools.product with itertools.compinations to cut unnecessary pairs. Even in the case, the simple version takes 39ms which is still longer than proposed version by 60%.

So I think the proposed version is good for the use of computing distance matrix, but also prefer to have some changes in terms of its code structure.

Glimpsing diff of the code, I noticed there are same pattern of codes which seem redundant. So it is nicer to gather those codes.

And, computing distance matrix is a bit out of the scope of this package, however it would be nice to have convenient function to calculate it. So, I would like to have the function under fastdtw.util where some utilities can be placed rather than directly under fastdtw.

Taking those into account, I prefer something like the following distmat to be implemented instead of dtw_parallel and fastdtw_parallel.

from functools import partial
from fastdtw import fastdtw
from fastdtw.util import distmat

dists, paths = distmat(partial(fastdtw, radius=3), X)