slaypni / fastdtw

A Python implementation of FastDTW
MIT License
777 stars 121 forks source link

add cython #4

Closed esquires closed 7 years ago

esquires commented 8 years ago

This also incorporated a way of calculating the dtw function, The new calculation resulted in 2x runtime performance with cython providing another 2x performance for a total of 4x. The test data was

import fastdtw
import numpy as np

thetas = np.linspace(0, 2 * np.pi, 1000)
x = np.sin(thetas)
y = np.sin(thetas * 2)
print(fastdtw.fastdtw_py(x, y, radius=1)[0])
print(fastdtw.fastdtw_py2(x, y, radius=1)[0])
print(fastdtw.fastdtw(x, y, radius=1)[0])

with the following conventions: fastdtw_py - code in fastdtw/fastdtw_py.py fastdtw_py2 - pure python method but with new dtw function (this code is not committed but could be regenerated by moving fastdtw.pyx to a pure python module) fastdtw - cythonized code

The new calculation method is faster because it

* removes an unnecessary generator that adds +1 to the window in the
  dtw method.
* minimizes accessing the defaultdict.
* caches the list in expand window rather than recreating it every
  loop
kernc commented 8 years ago

A 4x speed-up is negligible. With properly Cythonized code, you should be looking for a speed-up of one to three orders of magnitude. Can you use cython --annotate to identify any further parts that could benefit from optimization?

slaypni commented 8 years ago

@esquires Thank you for writing the code! Although the performance improvement is beneficial, I feel that it would be better to tweak some part of the changes before merge it into master branch. Things I noticed are:

I feel that it could be better not to make this module Cython dependent. The performance improvement will be provided in “opt-in” fashion. Then, Cython is not mandatory to install fastdtw, however, it will accelerate this module if Cython was in the environment. In the case, original .py file will be kept mostly as it is except trying to import cythonized functions written by @esquires if there are.

I would like to merge esquires:master into dev branch, then modify it, and merge it into master branch. How do you think about this way @esquires?

slaypni commented 8 years ago

@kernc The performance improvement of fastdtw.pyx against original code was 2.5x in my environment. Although I did not confirm it, the bottleneck of the performance may be heavily accessed defaultdict, which cannot be accelerated by Cython. defaultdict is used in order to guarantee O(N) complexity. In fastdtw 0.1.*, numpy array was used instead of defaultdict. Although it did not support O(N) memory complexity, it could have been way faster.

kernc commented 8 years ago

Indeed, if all yellow lines (fastdtw.html.zip) were made _white_r, the execution could be much faster. Cython can incorporate C++ types, and if targeting C++11+ is an option, unordered_map can be used as an almost drop-in dict replacement.

kernc commented 8 years ago

Then there are a couple of cython compiler directives one should ensure they apply. In particular, boundscheck=False, wraparound=False, cdivision=True should offer a significant speed-up.

esquires commented 8 years ago

Thanks for the comments. If c++11 is allowed then the solution will work great. It will need to use the std::pair as the key.

If not, one can build the window up row by row keeping track of the indexes. Those indexes can then be used to index the "D" object (now a vector instead of a default dict). It runs about 75x after cythonizing on my machine but the code is a bit more complicated.

kernc commented 8 years ago

C++ pairs do seem to work.

esquires commented 8 years ago

all right. I will push something in the next week.

esquires commented 8 years ago

The latest uses the stl (maps/sets) so it does not significantly alter the pure python version. The speedup is ~5-6x from what I am seeing. There are other opportunities for speed (noted in the file) but when I tried them they don't seem to give much more performance. Perhaps they will get up to 10x when done in aggregate. I also didn't notice any speedup with compiler directives. Let me know if I am missing something.

If code complexity is not a big issue I have pushed a branch in the fork called "indexing-approach". It basically uses arrays only instead of maps/sets and reaches 75x without compiler directives.

kernc commented 8 years ago

Let me know if I am missing something.

Indeed, I didn't stress this enough. The primary speed increase comes from reducing calls into Python library and staying in pure C for as much as possible. This means switching data structures for C++ structures as needed, but the mere switch is secondary to restructuring code. What I mean by restructuring is roughly this. Compare:

bad

Above compiles to C with lots of Py* calls, tuple construction, packing, reference counting. All in all not something you ever want in a tight loop. Sure, it could compile into a simple (?:) ternary operation, but apparently it does not. Better:

good

A couple more Cython lines, but far simpler and much, much faster C++ code.

A primary goal for a solid Cython implementation is as few yellow lines as possible. Instead of profiling, run cython with --annotate switch and inspect the resulting HTML file.

For what I know, list comprehensions don't optimize into C well, so perhaps consider replacing them. Applying only a couple such optimizations (still mostly yellow), I already achieved another 3x optimization (on top of your 5x).

kernc commented 8 years ago

If code complexity is not a big issue I have pushed a branch in the fork called "indexing-approach". It basically uses arrays only instead of maps/sets and reaches 75x without compiler directives.

Is space complexity still O(N)?

esquires commented 8 years ago

Is space complexity still O(N)?

Yes. Sorry for the ambiguity. The code logic is not as simple.

esquires commented 8 years ago

Thank for the feedback. I was a bit more aggressive this go around with optimization and got to around 7x in my environment. I think there wasn't a bit performance boost because the dtw loop is still calling a python dist function on 2 python lists (apparent from the cython annotaton). If one can assume a numpy array for the data this could probably be improved substantially.

2 other smaller areas that could be optimized are the window and path. The former can be a vector whose size has an upper bound given by window_. During the loop where it is currently set as a list a counter could be kept for the actual size, followed by resizing to the appropriate size. The path can also be a vector, but will need a wrapper to the dtw function since it is currently callable from an outside module.

Are any of those of interest?

kernc commented 8 years ago

I don't know the theory, but if the alternative indexing approach still works (correctly), is much faster, and uses the same amount of space, I sure have nothing against it.

I see on the other branch:

    cdef WindowElement *window = \
        <WindowElement *>PyMem_Malloc(len_x * len_y * sizeof(WindowElement))

    for x_idx in range(len_x):
        for y_idx in range(len_y):
            ...

Sure this is O(N)?

esquires commented 8 years ago

Sure this is O(N)?

    cdef WindowElement *window = \
        <WindowElement *>PyMem_Malloc(len_x * len_y * sizeof(WindowElement))
    for x_idx in range(len_x):
        for y_idx in range(len_y):
            ...

Yes. In the master branch see the line in dtw for the same implementation

    if window is None:
        window = [(i, j) for i in xrange(len_x) for j in xrange(len_y)]`

When calling dtw from some external module, yes it is O(N^2). The fastdtw algorithm only uses this line after reducing x and y to length <= radius + 1.

I don't know the theory, but if the alternative indexing approach still works (correctly), is much faster, and uses the same amount of space, I sure have nothing against it.

All right. I will adjust the pull request to indexing-approach after I adjust it for other items in this conversation.

kernc commented 8 years ago

I don't think @slaypni should mind. It's customary for Cython code to be a bit more verbose. Still keep it uniform and readable, of course. Perhaps add docstrings.

esquires commented 8 years ago

The latest reflects the indexing approach as well as the following elements of the above discussion:

To allow for faster processing for pnorms (where I can loop over a numpy array) I had to change the signature of the python file slightly so it can handle integer inputs for the dist argument.

The speedup depends on inputs which are summarized in the following. For completeness I am also showing the results from running on the master of the fork repo.

#dims uses pnorm radius speedup (indexing-approach) speedup (master)
1 True 1 94.6 6.0
1 True 10 335.8 5.7
1 False 1 16.5 4.8
1 False 10 20.9 4.1
2 True 1 130.0 19.0
2 True 10 172.7 13.0
2 False 1 2.2 1.8
2 False 10 2.3 1.8

The code can be recreated by checking out (in the fork) either master or indexing-approach and running the following:

from fastdtw.fastdtw import fastdtw as fastdtw_p
from fastdtw._fastdtw import fastdtw as fastdtw_c
import numpy as np
from timeit import default_timer
from pandas import DataFrame
from pprint import pprint

n = 1000
r = 1

run2d = False

thetas = np.linspace(0, 2 * np.pi, n)
t = np.linspace(0, 10, n)
x = t + np.sin(thetas)
y = 2 * t + np.sin(thetas * 2)

xx = np.hstack((x[:, np.newaxis], x[:, np.newaxis]))
yy = np.hstack((y[:, np.newaxis], y[:, np.newaxis]))

def dist_2d(x, y):
    return np.linalg.norm(x - y)

def verify(a, b):
    print(abs(a[0] - b[0]))
    for i in range(len(a[1])):                                                
        if (a[1][i][0] != b[1][i][0] or a[1][i][1] != b[1][i][1]):            
            print('{}: {}, {}'.format(i, a[1][i], b[1][i])) 

df = DataFrame(columns=['#dims', 'uses pnorm', 'radius', 'speedup'],
               data=np.zeros((8,4), dtype=np.object))

def run(x, y, r, dist):
    start = default_timer()
    a = fastdtw_c(x, y, r, dist)
    time_c = start - default_timer()

    start = default_timer()
    b = fastdtw_p(x, y, r, dist)
    time_p = start - default_timer()

    verify(a, b)
    return time_p / time_c

speedup = run(x, y, 1, None)
df.iloc[0] = [1, True, 1, '{:.1f}'.format(speedup)]

speedup = run(x, y, 10, None)
df.iloc[1] = [1, True, 10, '{:.1f}'.format(speedup)]

speedup = run(x, y, 1, lambda a, b: abs(a - b))
df.iloc[2] = [1, False, 1, '{:.1f}'.format(speedup)]

speedup = run(x, y, 10, lambda a, b: abs(a - b))
df.iloc[3] = [1, False, 10, '{:.1f}'.format(speedup)]

speedup = run(xx, yy, 1, 2)
df.iloc[4] = [2, True, 1, '{:.1f}'.format(speedup)]

speedup = run(xx, yy, 10, 2)
df.iloc[5] = [2, True, 10, '{:.1f}'.format(speedup)]

speedup = run(xx, yy, 1, lambda a, b: sum((a - b) ** 2) ** 0.5)
df.iloc[6] = [2, False, 1, '{:.1f}'.format(speedup)]

speedup = run(xx, yy, 10, lambda a, b: sum((a - b) ** 2) ** 0.5)
df.iloc[7] = [2, False, 10, '{:.1f}'.format(speedup)]

pprint(df)
esquires commented 8 years ago

Apparently one cannot assign a new branch to a pull request so I will close this version and open a new one that points to indexing-approach.

slaypni commented 7 years ago

The performance improvement looks great! I will see it.