Closed tommedema closed 1 year ago
The datatype is used to compile the C code. So there is currently no easy way to change the datatype. But this is a variable in our setup, so we can easily change it if you recompile yourself. I pushed a branch where all files are generated for the float C-type (thus np.float32): https://github.com/wannesm/dtaidistance/tree/feature/float
(warning: I did not test this extensively, just some quick tests with np.array([...], dtype=np.float32) )
ps: In principle we could generate multiple versions and combine them but this requires quite some disentanglement and extra code. And you are the first one to need this outside of our lab ;-)
@wannesm this is great, I will try it out asap. Do you know how I can change the config myself and recompile so that I can still benefit from future updates to dtaidistance?
There are three manual steps steps:
double
to float
in lines 22-23 in dtaidistance/jinja/generate.py
double
to float
line 19 in dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h
cd dtaidistance/jinja/
make
After this compiling via the setup.py file (or pip) should result in a version that expects float instead of double.
Excellent. I'll try both your generated version and my own compiled version and report back! Appreciate the awesome support as always.
@wannesm ok so I tried your first solution:
pip install --force-reinstall --upgrade --no-deps --no-build-isolation --no-binary dtaidistance git+https://github.com/wannesm/dtaidistance.git@feature/float#egg=dtaidistance
on a m1 macbook pro, and it gave me this error:
https://gist.github.com/tommedema/806ea6b9deb1c10391dad30cf476c5d2
I then tried the second solution (compiling from source after changing the setup files) but got this error:
https://gist.github.com/tommedema/409522a058aa17fa50e944335ccf0663
Again, I really appreciate the help.
The makefile could get confused about which files to update after changing branches (and thus didn't change some of the files which triggers the error you got). It now regenerates all files by default. The master and feature/float branches are updated.
@wannesm sounds promising, though re-installing from git (the updated feature/float branch) still gave me this error, even after restarting my jupyter server and kernel:
8:apply]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File <string>:1
File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_41941/2997127518.py:117, in parallelTrainTestByQueryIndex(queryIndex)
File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_41941/2997127518.py:31, in getQuerySeriesResults(q, series, additions, minMatchCount, maxMatchCount)
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/subsequence/dtw.py:589, in SubsequenceSearch.kbest_matches_fast(self, k)
587 def kbest_matches_fast(self, k=1):
588 self.dists_options['use_c'] = True
--> 589 return self.kbest_matches(k=k)
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/subsequence/dtw.py:592, in SubsequenceSearch.kbest_matches(self, k)
591 def kbest_matches(self, k=1):
--> 592 self.align(k=k)
593 if k is None:
594 return [SSMatch(best_idx, self) for best_idx in range(len(self.distances))]
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/subsequence/dtw.py:565, in SubsequenceSearch.align(self, k)
563 max_dist = np.inf
564 for idx, series in enumerate(self.s):
--> 565 dist = dtw.distance(self.query, series, **self.dists_options)
566 if k is not None:
567 if len(h) < k:
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/dtw.py:223, in distance(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_c, use_pruning, only_ub)
221 logger.warning("C-library not available, using the Python version")
222 else:
--> 223 return distance_fast(s1, s2, window,
224 max_dist=max_dist,
225 max_step=max_step,
226 max_length_diff=max_length_diff,
227 penalty=penalty,
228 psi=psi,
229 use_pruning=use_pruning,
230 only_ub=only_ub)
231 r, c = len(s1), len(s2)
232 if max_length_diff is not None and abs(r - c) > max_length_diff:
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/dtw.py:340, in distance_fast(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_pruning, only_ub)
338 s2 = util_numpy.verify_np_array(s2)
339 # Move data to C library
--> 340 d = dtw_cc.distance(s1, s2,
341 window=window,
342 max_dist=max_dist,
343 max_step=max_step,
344 max_length_diff=max_length_diff,
345 penalty=penalty,
346 psi=psi,
347 use_pruning=use_pruning,
348 only_ub=only_ub)
349 return d
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/dtw_cc.pyx:289, in dtaidistance.dtw_cc.distance()
ValueError: Buffer dtype mismatch, expected 'float' but got 'double'
And, interestingly, building from source (after updating to your new master changes) gave me the opposite error:
[7:apply]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File <string>:1
File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_42677/2997127518.py:117, in parallelTrainTestByQueryIndex(queryIndex)
File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_42677/2997127518.py:31, in getQuerySeriesResults(q, series, additions, minMatchCount, maxMatchCount)
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/subsequence/dtw.py:589, in SubsequenceSearch.kbest_matches_fast(self, k)
587 def kbest_matches_fast(self, k=1):
588 self.dists_options['use_c'] = True
--> 589 return self.kbest_matches(k=k)
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/subsequence/dtw.py:592, in SubsequenceSearch.kbest_matches(self, k)
591 def kbest_matches(self, k=1):
--> 592 self.align(k=k)
593 if k is None:
594 return [SSMatch(best_idx, self) for best_idx in range(len(self.distances))]
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/subsequence/dtw.py:565, in SubsequenceSearch.align(self, k)
563 max_dist = np.inf
564 for idx, series in enumerate(self.s):
--> 565 dist = dtw.distance(self.query, series, **self.dists_options)
566 if k is not None:
567 if len(h) < k:
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw.py:223, in distance(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_c, use_pruning, only_ub)
221 logger.warning("C-library not available, using the Python version")
222 else:
--> 223 return distance_fast(s1, s2, window,
224 max_dist=max_dist,
225 max_step=max_step,
226 max_length_diff=max_length_diff,
227 penalty=penalty,
228 psi=psi,
229 use_pruning=use_pruning,
230 only_ub=only_ub)
231 r, c = len(s1), len(s2)
232 if max_length_diff is not None and abs(r - c) > max_length_diff:
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw.py:340, in distance_fast(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_pruning, only_ub)
338 s2 = util_numpy.verify_np_array(s2)
339 # Move data to C library
--> 340 d = dtw_cc.distance(s1, s2,
341 window=window,
342 max_dist=max_dist,
343 max_step=max_step,
344 max_length_diff=max_length_diff,
345 penalty=penalty,
346 psi=psi,
347 use_pruning=use_pruning,
348 only_ub=only_ub)
349 return d
File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw_cc.pyx:289, in dtaidistance.dtw_cc.distance()
ValueError: Buffer dtype mismatch, expected 'double' but got 'float'
I checked that all my input queries and series are of type np.float32
. I also ran pip uninstall dtaidistance before doing any of the above.
That's indeed surprising. I cannot get it reproduced myself when starting from a clean installation from the feature/float branch and recompiling. Just for reference what I do, from a new virtualenv and git clone (to have access to tests, I included one with subseq search):
$ git clone https://github.com/wannesm/dtaidistance.git
$ cd dtaidistance
$ git co feature/float
$ pip install .
$ cd .. # to not use local repo as package
$ pip install pytest
$ python dtaidistance/tests/test_float.py
The first one is especially surprising. Maybe there is a transformation I forgot about. But it's surprising it's not triggered in the second version.
The second one feels like a wrong version of the compiled library is picked up. It would help to print the mentioned line to see whether the toolbox is simply picking up the wrong compiled library. The .pyx file should mention floats.
$ sed -n '289p' ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw_cc.pyx
def distance(float[:] s1, float[:] s2, **kwargs):
@wannesm I think you're right that it must be the data I'm passing in, as it does work when I do something like:
subsequence_search(q.astype(np.float32), series.astype(np.float32), dists_options={'use_c': True, 'max_dist': maxDistance})
It's odd given that I am sure I'm passing in float32, but clearly this must be on my side. Apologies for the back and forth and appreciate the help. I will figure out where the data is somehow not float32 next :)
@wannesm btw - somehow after changing to float32 each core is still using 1.87GB at peak, where the actual data only seems to be about 60MB. Do you think this could be because of the matrix calculations subsequence_search is doing? I wonder if that expands the RAM usage.
Update: I think this is resolved by setting k = 200 (for example) instead of None when calling kbest_matches_fast
RAM usage has now gone from 150GB to 13GB with 80 cores. Thank you :)
Do you see memory usage go up during running subsequence search? This shouldn't happen too much.
Even without a window, the memory usage for the DTW computation is 2*len(array)
(with window it is 2*2*window
).
The subsequencesearch is keeping track of all distances for all segments instead of the top-k which is suboptimal. That is a memory usage of 8*nb_segments/1024**2
MiB, and thus still surprising to be higher than the full data. In any case, I removed this array in the master branch as it is not strictly required (it now stores only k distances+indices).
ps: It's interesting to see cases like yours which push the limits with large datasets and many cores.
@wannesm yes, it's usually low, but then at the very end of returning the k best results it spikes to several gigabytes. This doesn't happen when k != None. I'm okay with setting k so this is not a big problem for me.
I pulled in master, changed the doubles to floats in the config files, and compiled, and installed using pip, and it does work now. 🎉 I wonder if the pip install method is more reliable than the setup script:
$ git clone https://github.com/wannesm/dtaidistance.git && cd dtaidistance
$ sed -i '' 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h
$ cd dtaidistance/jinja && make && cd ../..
$ pip install .
Note that on unix (not macOS) it would be sed -i 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h
I'll keep k at an integer (not None) given your recommendation in the source code comments (very helpful, thank you).
FYI- the RAM usage is the same when K is at None, even after recent changes (not a problem for me but figured I'd share in case this is interesting to you):
When setting K it stays at around 60-100MB rather than 1.9GB.
It's great to see peak RAM usage reduced from 150GB to just 13GB. Means I could expand my data set further..
It is very interesting indeed! It's my first time working with this many cores and RAM. I have more things to work on especially around relaxing my window requirements (looking into PSI next). Really appreciate the swift responses. It's pretty admirable to see your C code, I am limited to high level languages currently (never got into C but have curiosity towards it).
I didn't anticipate using large values for k. The increase of the memory at the end is the creation of an array to hold all results. But this array creates an expensive object for every match, which is unnecessary. I removed this. Querying results is now lazy and only creates an object if you ask for details on a match (and can garbage collect if you don't need it anymore).
I didn't anticipate using large values for k. The increase of the memory at the end is the creation of an array to hold all results. But this array creates an expensive object for every match, which is unnecessary. I removed this. Querying results is now lazy and only creates an object if you ask for details on a match (and can garbage collect if you don't need it anymore).
That's perfect. Sounds like I can use k = None again? I'll give it a try!
I would still not advise it ;-) Setting k also avoids doing DTW computations and that's the expensive part. DTW computations are stopped early once it's clear that it cannot be better anymore than the k-best distance up to that point. Setting k to None is the same as running all comparisons, which could be faster using parallellization etc (like for the distance_matrix computation).
@wannesm I wanted to bring something to your attention regarding this. When reinstalling on another macbook (intel), I discovered the following behavior when using the setup.py script:
pip uninstall dtaidistance
export LDFLAGS="-L/usr/local/opt/libomp/lib"
export CPPFLAGS="-I/usr/local/opt/libomp/include"
git clone https://github.com/wannesm/dtaidistance.git && cd dtaidistance
sed -i '' 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h
python3 setup.py build_ext --inplace
python3 setup.py install
Results in these double / float errors:
https://gist.github.com/tommedema/d2ab88161e6732a5029e0bfe5ce02485
When I then try dtw.try_import_c
I get:
https://gist.github.com/tommedema/3c062c752df68567e0b15dcb9009ba92
However, when installing as before, it works fine:
pip uninstall dtaidistance
export LDFLAGS="-L/usr/local/opt/libomp/lib"
export CPPFLAGS="-I/usr/local/opt/libomp/include"
git clone https://github.com/wannesm/dtaidistance.git && cd dtaidistance
sed -i '' 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h
cd dtaidistance/jinja && make && cd ../..
pip install .
Here I see no error messages, and the C version can be correctly imported.
Any idea what's causing this difference?
You are not running the make
command in the jinja
directory in the first version? This is not automatically triggered by the setup.py file because we didn't want to make jinja a required dependency. We add the generated files directly to the git repository.
The exact error you see is also because I further improved the independence of the exact datatype used throughout the code. Almost all occurrences of 'double' have been removed from the C and Cython code and all types are now covered a by a typedef of seq_t
. This is done in dtaidistance_globals.pxd
(which mirrors dd_globals
from the c code). And if you don't run the makefile, this is still set to 'double' for the Cython code (or add dtaidistance_globals.pxd
to your sed command).
Ps: To make it easier, I added makefile rules to easily switch between types. You can now drop the sed command and simply do:
cd dtaididistance/jinja && make float && cd ../..
That worked! My bad for not running make with the setup script. The new make float
is sweet, thanks for that :)
Do you recommend using the setup script or just using pip install .
? I'm not sure I understand the difference.
BTW- I did see this warning (repeated about 200x) when running make:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/_stdio.h:93:16: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
unsigned char *_base;
Full log at https://gist.github.com/tommedema/938ce28659a3257513fc0d819b0dd355
Not sure if it matters, since everything seems to work just fine.
I'm calculating DTW distances on a machine with 80 cores and 128GB of RAM. Unfortunately my dataset doesn't fit in the 128GB of RAM due to dtaidistance currently requiring np.float64 (double) as a datatype. If I could change my datatype to np.float32, it would fit easily. Unfortunately casting down results in the following issue when passing the float32 series to subsequence search:
Would it be possible to resolve this? That would be huge for large datasets