Closed esquires closed 7 years ago
A 4x speed-up is negligible. With properly Cythonized code, you should be looking for a speed-up of one to three orders of magnitude. Can you use cython --annotate
to identify any further parts that could benefit from optimization?
@esquires Thank you for writing the code! Although the performance improvement is beneficial, I feel that it would be better to tweak some part of the changes before merge it into master branch. Things I noticed are:
I feel that it could be better not to make this module Cython dependent. The performance improvement will be provided in “opt-in” fashion. Then, Cython is not mandatory to install fastdtw, however, it will accelerate this module if Cython was in the environment. In the case, original .py file will be kept mostly as it is except trying to import cythonized functions written by @esquires if there are.
I would like to merge esquires:master into dev branch, then modify it, and merge it into master branch. How do you think about this way @esquires?
@kernc The performance improvement of fastdtw.pyx against original code was 2.5x in my environment. Although I did not confirm it, the bottleneck of the performance may be heavily accessed defaultdict, which cannot be accelerated by Cython. defaultdict is used in order to guarantee O(N) complexity. In fastdtw 0.1.*, numpy array was used instead of defaultdict. Although it did not support O(N) memory complexity, it could have been way faster.
Indeed, if all yellow lines (fastdtw.html.zip) were made _white_r, the execution could be much faster.
Cython can incorporate C++ types, and if targeting C++11+ is an option, unordered_map
can be used as an almost drop-in dict replacement.
Then there are a couple of cython compiler directives one should ensure they apply. In particular, boundscheck=False, wraparound=False, cdivision=True
should offer a significant speed-up.
Thanks for the comments. If c++11 is allowed then the solution will work great. It will need to use the std::pair as the key.
If not, one can build the window up row by row keeping track of the indexes. Those indexes can then be used to index the "D" object (now a vector instead of a default dict). It runs about 75x after cythonizing on my machine but the code is a bit more complicated.
C++ pairs do seem to work.
all right. I will push something in the next week.
The latest uses the stl (maps/sets) so it does not significantly alter the pure python version. The speedup is ~5-6x from what I am seeing. There are other opportunities for speed (noted in the file) but when I tried them they don't seem to give much more performance. Perhaps they will get up to 10x when done in aggregate. I also didn't notice any speedup with compiler directives. Let me know if I am missing something.
If code complexity is not a big issue I have pushed a branch in the fork called "indexing-approach". It basically uses arrays only instead of maps/sets and reaches 75x without compiler directives.
Let me know if I am missing something.
Indeed, I didn't stress this enough. The primary speed increase comes from reducing calls into Python library and staying in pure C for as much as possible. This means switching data structures for C++ structures as needed, but the mere switch is secondary to restructuring code. What I mean by restructuring is roughly this. Compare:
Above compiles to C with lots of Py* calls, tuple construction, packing, reference counting. All in all not something you ever want in a tight loop. Sure, it could compile into a simple (?:) ternary operation, but apparently it does not. Better:
A couple more Cython lines, but far simpler and much, much faster C++ code.
A primary goal for a solid Cython implementation is as few yellow lines as possible. Instead of profiling, run cython with --annotate
switch and inspect the resulting HTML file.
For what I know, list comprehensions don't optimize into C well, so perhaps consider replacing them. Applying only a couple such optimizations (still mostly yellow), I already achieved another 3x optimization (on top of your 5x).
If code complexity is not a big issue I have pushed a branch in the fork called "indexing-approach". It basically uses arrays only instead of maps/sets and reaches 75x without compiler directives.
Is space complexity still O(N)?
Is space complexity still O(N)?
Yes. Sorry for the ambiguity. The code logic is not as simple.
Thank for the feedback. I was a bit more aggressive this go around with optimization and got to around 7x in my environment. I think there wasn't a bit performance boost because the dtw loop is still calling a python dist function on 2 python lists (apparent from the cython annotaton). If one can assume a numpy array for the data this could probably be improved substantially.
2 other smaller areas that could be optimized are the window and path. The former can be a vector whose size has an upper bound given by window_. During the loop where it is currently set as a list a counter could be kept for the actual size, followed by resizing to the appropriate size. The path can also be a vector, but will need a wrapper to the dtw function since it is currently callable from an outside module.
Are any of those of interest?
I don't know the theory, but if the alternative indexing approach still works (correctly), is much faster, and uses the same amount of space, I sure have nothing against it.
I see on the other branch:
cdef WindowElement *window = \
<WindowElement *>PyMem_Malloc(len_x * len_y * sizeof(WindowElement))
for x_idx in range(len_x):
for y_idx in range(len_y):
...
Sure this is O(N)?
Sure this is O(N)?
cdef WindowElement *window = \ <WindowElement *>PyMem_Malloc(len_x * len_y * sizeof(WindowElement)) for x_idx in range(len_x): for y_idx in range(len_y): ...
Yes. In the master branch see the line in dtw for the same implementation
if window is None:
window = [(i, j) for i in xrange(len_x) for j in xrange(len_y)]`
When calling dtw from some external module, yes it is O(N^2). The fastdtw algorithm only uses this line after reducing x
and y
to length <= radius + 1
.
I don't know the theory, but if the alternative indexing approach still works (correctly), is much faster, and uses the same amount of space, I sure have nothing against it.
All right. I will adjust the pull request to indexing-approach after I adjust it for other items in this conversation.
I don't think @slaypni should mind. It's customary for Cython code to be a bit more verbose. Still keep it uniform and readable, of course. Perhaps add docstrings.
The latest reflects the indexing approach as well as the following elements of the above discussion:
To allow for faster processing for pnorms (where I can loop over a numpy
array) I had to change the signature of the python file slightly so it can
handle integer inputs for the dist
argument.
The speedup depends on inputs which are summarized in the following. For completeness I am also showing the results from running on the master of the fork repo.
#dims | uses pnorm | radius | speedup (indexing-approach) | speedup (master) |
---|---|---|---|---|
1 | True | 1 | 94.6 | 6.0 |
1 | True | 10 | 335.8 | 5.7 |
1 | False | 1 | 16.5 | 4.8 |
1 | False | 10 | 20.9 | 4.1 |
2 | True | 1 | 130.0 | 19.0 |
2 | True | 10 | 172.7 | 13.0 |
2 | False | 1 | 2.2 | 1.8 |
2 | False | 10 | 2.3 | 1.8 |
The code can be recreated by checking out (in the fork) either master or indexing-approach and running the following:
from fastdtw.fastdtw import fastdtw as fastdtw_p
from fastdtw._fastdtw import fastdtw as fastdtw_c
import numpy as np
from timeit import default_timer
from pandas import DataFrame
from pprint import pprint
n = 1000
r = 1
run2d = False
thetas = np.linspace(0, 2 * np.pi, n)
t = np.linspace(0, 10, n)
x = t + np.sin(thetas)
y = 2 * t + np.sin(thetas * 2)
xx = np.hstack((x[:, np.newaxis], x[:, np.newaxis]))
yy = np.hstack((y[:, np.newaxis], y[:, np.newaxis]))
def dist_2d(x, y):
return np.linalg.norm(x - y)
def verify(a, b):
print(abs(a[0] - b[0]))
for i in range(len(a[1])):
if (a[1][i][0] != b[1][i][0] or a[1][i][1] != b[1][i][1]):
print('{}: {}, {}'.format(i, a[1][i], b[1][i]))
df = DataFrame(columns=['#dims', 'uses pnorm', 'radius', 'speedup'],
data=np.zeros((8,4), dtype=np.object))
def run(x, y, r, dist):
start = default_timer()
a = fastdtw_c(x, y, r, dist)
time_c = start - default_timer()
start = default_timer()
b = fastdtw_p(x, y, r, dist)
time_p = start - default_timer()
verify(a, b)
return time_p / time_c
speedup = run(x, y, 1, None)
df.iloc[0] = [1, True, 1, '{:.1f}'.format(speedup)]
speedup = run(x, y, 10, None)
df.iloc[1] = [1, True, 10, '{:.1f}'.format(speedup)]
speedup = run(x, y, 1, lambda a, b: abs(a - b))
df.iloc[2] = [1, False, 1, '{:.1f}'.format(speedup)]
speedup = run(x, y, 10, lambda a, b: abs(a - b))
df.iloc[3] = [1, False, 10, '{:.1f}'.format(speedup)]
speedup = run(xx, yy, 1, 2)
df.iloc[4] = [2, True, 1, '{:.1f}'.format(speedup)]
speedup = run(xx, yy, 10, 2)
df.iloc[5] = [2, True, 10, '{:.1f}'.format(speedup)]
speedup = run(xx, yy, 1, lambda a, b: sum((a - b) ** 2) ** 0.5)
df.iloc[6] = [2, False, 1, '{:.1f}'.format(speedup)]
speedup = run(xx, yy, 10, lambda a, b: sum((a - b) ** 2) ** 0.5)
df.iloc[7] = [2, False, 10, '{:.1f}'.format(speedup)]
pprint(df)
Apparently one cannot assign a new branch to a pull request so I will close this version and open a new one that points to indexing-approach.
The performance improvement looks great! I will see it.
This also incorporated a way of calculating the dtw function, The new calculation resulted in 2x runtime performance with cython providing another 2x performance for a total of 4x. The test data was
with the following conventions: fastdtw_py - code in fastdtw/fastdtw_py.py fastdtw_py2 - pure python method but with new dtw function (this code is not committed but could be regenerated by moving fastdtw.pyx to a pure python module) fastdtw - cythonized code
The new calculation method is faster because it