Does this support Numba GPU utilization?

No it doesn't.

The reason for this is DTW is almost equivalent to a dynamic programming problem in the simplest brute force setting, and CUDA kernels require independence in their operations.

There are papers that parallelize DTW using prefix notations to compute approximate forms of parallel Cuda optimized DTW, which is exceptionally fast, but my objective was to have an exact DTW implementation that is almost at C levels of speed post-compilation and can be parallelized in multiprocess environment to speed up batch processing.

Technically, with it's current optimizations, we were able to run exact DTW on Starlight Curves dataset (1000 train samples, 8236 test samples, 1024 timesteps univariate dataset from UCR) in around an hour. This entails computing a 1000x8236 distance matrix with 1024x1024 comparisons for each of those 8 million comparisons using the DTW algorithm, which is pretty fast on a 16 core machine.

titu1994 / dtw-numba

Does this support Numba GPU utilization? #2