wannesm / dtaidistance

Time series distances: Dynamic Time Warping (fast DTW implementation in C)
Other
1.08k stars 184 forks source link

ValueError: Buffer dtype mismatch, expected 'double' but got 'float' #183

Closed tommedema closed 1 year ago

tommedema commented 1 year ago

I'm calculating DTW distances on a machine with 80 cores and 128GB of RAM. Unfortunately my dataset doesn't fit in the 128GB of RAM due to dtaidistance currently requiring np.float64 (double) as a datatype. If I could change my datatype to np.float32, it would fit easily. Unfortunately casting down results in the following issue when passing the float32 series to subsequence search:

File /usr/local/lib/python3.10/dist-packages/dtaidistance/subsequence/dtw.py:589, in SubsequenceSearch.kbest_matches_fast(self, k)
    587 def kbest_matches_fast(self, k=1):
    588     self.dists_options['use_c'] = True
--> 589     return self.kbest_matches(k=k)

File /usr/local/lib/python3.10/dist-packages/dtaidistance/subsequence/dtw.py:592, in SubsequenceSearch.kbest_matches(self, k)
    591 def kbest_matches(self, k=1):
--> 592     self.align(k=k)
    593     if k is None:
    594         return [SSMatch(best_idx, self) for best_idx in range(len(self.distances))]

File /usr/local/lib/python3.10/dist-packages/dtaidistance/subsequence/dtw.py:565, in SubsequenceSearch.align(self, k)
    563 max_dist = np.inf
    564 for idx, series in enumerate(self.s):
--> 565     dist = dtw.distance(self.query, series, **self.dists_options)
    566     if k is not None:
    567         if len(h) < k:

File /usr/local/lib/python3.10/dist-packages/dtaidistance/dtw.py:178, in distance(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_c, use_pruning, only_ub)
    176         logger.warning("C-library not available, using the Python version")
    177     else:
--> 178         return distance_fast(s1, s2, window,
    179                              max_dist=max_dist,
    180                              max_step=max_step,
    181                              max_length_diff=max_length_diff,
    182                              penalty=penalty,
    183                              psi=psi,
    184                              use_pruning=use_pruning,
    185                              only_ub=only_ub)
    186 r, c = len(s1), len(s2)
    187 if max_length_diff is not None and abs(r - c) > max_length_diff:

File /usr/local/lib/python3.10/dist-packages/dtaidistance/dtw.py:295, in distance_fast(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_pruning, only_ub)
    293 s2 = util_numpy.verify_np_array(s2)
    294 # Move data to C library
--> 295 d = dtw_cc.distance(s1, s2,
    296                     window=window,
    297                     max_dist=max_dist,
    298                     max_step=max_step,
    299                     max_length_diff=max_length_diff,
    300                     penalty=penalty,
    301                     psi=psi,
    302                     use_pruning=use_pruning,
    303                     only_ub=only_ub)
    304 return d

File /usr/local/lib/python3.10/dist-packages/dtaidistance/dtw_cc.pyx:281, in dtaidistance.dtw_cc.distance()

ValueError: Buffer dtype mismatch, expected 'double' but got 'float'

Would it be possible to resolve this? That would be huge for large datasets

wannesm commented 1 year ago

The datatype is used to compile the C code. So there is currently no easy way to change the datatype. But this is a variable in our setup, so we can easily change it if you recompile yourself. I pushed a branch where all files are generated for the float C-type (thus np.float32): https://github.com/wannesm/dtaidistance/tree/feature/float

(warning: I did not test this extensively, just some quick tests with np.array([...], dtype=np.float32) )

ps: In principle we could generate multiple versions and combine them but this requires quite some disentanglement and extra code. And you are the first one to need this outside of our lab ;-)

tommedema commented 1 year ago

@wannesm this is great, I will try it out asap. Do you know how I can change the config myself and recompile so that I can still benefit from future updates to dtaidistance?

wannesm commented 1 year ago

There are three manual steps steps:

  1. Change double to float in lines 22-23 in dtaidistance/jinja/generate.py
    https://github.com/wannesm/dtaidistance/blob/2a7744574a1e42a019c9c63a026911c1a3f6e7fb/dtaidistance/jinja/generate.py#L22-L23
  2. Change double to float line 19 in dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h
    https://github.com/wannesm/dtaidistance/blob/2a7744574a1e42a019c9c63a026911c1a3f6e7fb/dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h#L19
  3. Regenerate the source code files with:
    cd dtaidistance/jinja/
    make

After this compiling via the setup.py file (or pip) should result in a version that expects float instead of double.

tommedema commented 1 year ago

Excellent. I'll try both your generated version and my own compiled version and report back! Appreciate the awesome support as always.

tommedema commented 1 year ago

@wannesm ok so I tried your first solution:

pip install --force-reinstall --upgrade --no-deps --no-build-isolation --no-binary dtaidistance git+https://github.com/wannesm/dtaidistance.git@feature/float#egg=dtaidistance

on a m1 macbook pro, and it gave me this error:

https://gist.github.com/tommedema/806ea6b9deb1c10391dad30cf476c5d2

I then tried the second solution (compiling from source after changing the setup files) but got this error:

https://gist.github.com/tommedema/409522a058aa17fa50e944335ccf0663

Again, I really appreciate the help.

wannesm commented 1 year ago

The makefile could get confused about which files to update after changing branches (and thus didn't change some of the files which triggers the error you got). It now regenerates all files by default. The master and feature/float branches are updated.

tommedema commented 1 year ago

@wannesm sounds promising, though re-installing from git (the updated feature/float branch) still gave me this error, even after restarting my jupyter server and kernel:

8:apply]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <string>:1

File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_41941/2997127518.py:117, in parallelTrainTestByQueryIndex(queryIndex)

File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_41941/2997127518.py:31, in getQuerySeriesResults(q, series, additions, minMatchCount, maxMatchCount)

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/subsequence/dtw.py:589, in SubsequenceSearch.kbest_matches_fast(self, k)
    587 def kbest_matches_fast(self, k=1):
    588     self.dists_options['use_c'] = True
--> 589     return self.kbest_matches(k=k)

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/subsequence/dtw.py:592, in SubsequenceSearch.kbest_matches(self, k)
    591 def kbest_matches(self, k=1):
--> 592     self.align(k=k)
    593     if k is None:
    594         return [SSMatch(best_idx, self) for best_idx in range(len(self.distances))]

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/subsequence/dtw.py:565, in SubsequenceSearch.align(self, k)
    563 max_dist = np.inf
    564 for idx, series in enumerate(self.s):
--> 565     dist = dtw.distance(self.query, series, **self.dists_options)
    566     if k is not None:
    567         if len(h) < k:

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/dtw.py:223, in distance(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_c, use_pruning, only_ub)
    221         logger.warning("C-library not available, using the Python version")
    222     else:
--> 223         return distance_fast(s1, s2, window,
    224                              max_dist=max_dist,
    225                              max_step=max_step,
    226                              max_length_diff=max_length_diff,
    227                              penalty=penalty,
    228                              psi=psi,
    229                              use_pruning=use_pruning,
    230                              only_ub=only_ub)
    231 r, c = len(s1), len(s2)
    232 if max_length_diff is not None and abs(r - c) > max_length_diff:

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/dtw.py:340, in distance_fast(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_pruning, only_ub)
    338 s2 = util_numpy.verify_np_array(s2)
    339 # Move data to C library
--> 340 d = dtw_cc.distance(s1, s2,
    341                     window=window,
    342                     max_dist=max_dist,
    343                     max_step=max_step,
    344                     max_length_diff=max_length_diff,
    345                     penalty=penalty,
    346                     psi=psi,
    347                     use_pruning=use_pruning,
    348                     only_ub=only_ub)
    349 return d

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/dtw_cc.pyx:289, in dtaidistance.dtw_cc.distance()

ValueError: Buffer dtype mismatch, expected 'float' but got 'double'

And, interestingly, building from source (after updating to your new master changes) gave me the opposite error:

[7:apply]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <string>:1

File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_42677/2997127518.py:117, in parallelTrainTestByQueryIndex(queryIndex)

File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_42677/2997127518.py:31, in getQuerySeriesResults(q, series, additions, minMatchCount, maxMatchCount)

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/subsequence/dtw.py:589, in SubsequenceSearch.kbest_matches_fast(self, k)
    587 def kbest_matches_fast(self, k=1):
    588     self.dists_options['use_c'] = True
--> 589     return self.kbest_matches(k=k)

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/subsequence/dtw.py:592, in SubsequenceSearch.kbest_matches(self, k)
    591 def kbest_matches(self, k=1):
--> 592     self.align(k=k)
    593     if k is None:
    594         return [SSMatch(best_idx, self) for best_idx in range(len(self.distances))]

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/subsequence/dtw.py:565, in SubsequenceSearch.align(self, k)
    563 max_dist = np.inf
    564 for idx, series in enumerate(self.s):
--> 565     dist = dtw.distance(self.query, series, **self.dists_options)
    566     if k is not None:
    567         if len(h) < k:

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw.py:223, in distance(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_c, use_pruning, only_ub)
    221         logger.warning("C-library not available, using the Python version")
    222     else:
--> 223         return distance_fast(s1, s2, window,
    224                              max_dist=max_dist,
    225                              max_step=max_step,
    226                              max_length_diff=max_length_diff,
    227                              penalty=penalty,
    228                              psi=psi,
    229                              use_pruning=use_pruning,
    230                              only_ub=only_ub)
    231 r, c = len(s1), len(s2)
    232 if max_length_diff is not None and abs(r - c) > max_length_diff:

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw.py:340, in distance_fast(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_pruning, only_ub)
    338 s2 = util_numpy.verify_np_array(s2)
    339 # Move data to C library
--> 340 d = dtw_cc.distance(s1, s2,
    341                     window=window,
    342                     max_dist=max_dist,
    343                     max_step=max_step,
    344                     max_length_diff=max_length_diff,
    345                     penalty=penalty,
    346                     psi=psi,
    347                     use_pruning=use_pruning,
    348                     only_ub=only_ub)
    349 return d

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw_cc.pyx:289, in dtaidistance.dtw_cc.distance()

ValueError: Buffer dtype mismatch, expected 'double' but got 'float'

I checked that all my input queries and series are of type np.float32. I also ran pip uninstall dtaidistance before doing any of the above.

wannesm commented 1 year ago

That's indeed surprising. I cannot get it reproduced myself when starting from a clean installation from the feature/float branch and recompiling. Just for reference what I do, from a new virtualenv and git clone (to have access to tests, I included one with subseq search):

$ git clone https://github.com/wannesm/dtaidistance.git
$ cd dtaidistance
$ git co feature/float
$ pip install .
$ cd .. # to not use local repo as package
$ pip install pytest
$ python dtaidistance/tests/test_float.py

The first one is especially surprising. Maybe there is a transformation I forgot about. But it's surprising it's not triggered in the second version.

The second one feels like a wrong version of the compiled library is picked up. It would help to print the mentioned line to see whether the toolbox is simply picking up the wrong compiled library. The .pyx file should mention floats.

$ sed -n '289p' ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw_cc.pyx
def distance(float[:] s1, float[:] s2, **kwargs):
tommedema commented 1 year ago

@wannesm I think you're right that it must be the data I'm passing in, as it does work when I do something like:

subsequence_search(q.astype(np.float32), series.astype(np.float32), dists_options={'use_c': True, 'max_dist': maxDistance})

It's odd given that I am sure I'm passing in float32, but clearly this must be on my side. Apologies for the back and forth and appreciate the help. I will figure out where the data is somehow not float32 next :)

tommedema commented 1 year ago

@wannesm btw - somehow after changing to float32 each core is still using 1.87GB at peak, where the actual data only seems to be about 60MB. Do you think this could be because of the matrix calculations subsequence_search is doing? I wonder if that expands the RAM usage.

Update: I think this is resolved by setting k = 200 (for example) instead of None when calling kbest_matches_fast

RAM usage has now gone from 150GB to 13GB with 80 cores. Thank you :)

wannesm commented 1 year ago

Do you see memory usage go up during running subsequence search? This shouldn't happen too much.

Even without a window, the memory usage for the DTW computation is 2*len(array) (with window it is 2*2*window). The subsequencesearch is keeping track of all distances for all segments instead of the top-k which is suboptimal. That is a memory usage of 8*nb_segments/1024**2 MiB, and thus still surprising to be higher than the full data. In any case, I removed this array in the master branch as it is not strictly required (it now stores only k distances+indices).

ps: It's interesting to see cases like yours which push the limits with large datasets and many cores.

tommedema commented 1 year ago

@wannesm yes, it's usually low, but then at the very end of returning the k best results it spikes to several gigabytes. This doesn't happen when k != None. I'm okay with setting k so this is not a big problem for me.

I pulled in master, changed the doubles to floats in the config files, and compiled, and installed using pip, and it does work now. 🎉 I wonder if the pip install method is more reliable than the setup script:

$ git clone https://github.com/wannesm/dtaidistance.git && cd dtaidistance
$ sed -i '' 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h
$ cd dtaidistance/jinja && make && cd ../..
$ pip install .

Note that on unix (not macOS) it would be sed -i 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h

I'll keep k at an integer (not None) given your recommendation in the source code comments (very helpful, thank you).

FYI- the RAM usage is the same when K is at None, even after recent changes (not a problem for me but figured I'd share in case this is interesting to you):

Screen Shot 2022-10-10 at 6 31 34 PM

When setting K it stays at around 60-100MB rather than 1.9GB.

It's great to see peak RAM usage reduced from 150GB to just 13GB. Means I could expand my data set further..

It is very interesting indeed! It's my first time working with this many cores and RAM. I have more things to work on especially around relaxing my window requirements (looking into PSI next). Really appreciate the swift responses. It's pretty admirable to see your C code, I am limited to high level languages currently (never got into C but have curiosity towards it).

wannesm commented 1 year ago

I didn't anticipate using large values for k. The increase of the memory at the end is the creation of an array to hold all results. But this array creates an expensive object for every match, which is unnecessary. I removed this. Querying results is now lazy and only creates an object if you ask for details on a match (and can garbage collect if you don't need it anymore).

tommedema commented 1 year ago

I didn't anticipate using large values for k. The increase of the memory at the end is the creation of an array to hold all results. But this array creates an expensive object for every match, which is unnecessary. I removed this. Querying results is now lazy and only creates an object if you ask for details on a match (and can garbage collect if you don't need it anymore).

That's perfect. Sounds like I can use k = None again? I'll give it a try!

wannesm commented 1 year ago

I would still not advise it ;-) Setting k also avoids doing DTW computations and that's the expensive part. DTW computations are stopped early once it's clear that it cannot be better anymore than the k-best distance up to that point. Setting k to None is the same as running all comparisons, which could be faster using parallellization etc (like for the distance_matrix computation).

tommedema commented 1 year ago

@wannesm I wanted to bring something to your attention regarding this. When reinstalling on another macbook (intel), I discovered the following behavior when using the setup.py script:

pip uninstall dtaidistance

export LDFLAGS="-L/usr/local/opt/libomp/lib"
export CPPFLAGS="-I/usr/local/opt/libomp/include"

git clone https://github.com/wannesm/dtaidistance.git && cd dtaidistance

sed -i '' 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h

python3 setup.py build_ext --inplace

python3 setup.py install

Results in these double / float errors:

https://gist.github.com/tommedema/d2ab88161e6732a5029e0bfe5ce02485

When I then try dtw.try_import_c I get:

https://gist.github.com/tommedema/3c062c752df68567e0b15dcb9009ba92

However, when installing as before, it works fine:

pip uninstall dtaidistance

export LDFLAGS="-L/usr/local/opt/libomp/lib"
export CPPFLAGS="-I/usr/local/opt/libomp/include"

git clone https://github.com/wannesm/dtaidistance.git && cd dtaidistance

sed -i '' 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h

cd dtaidistance/jinja && make && cd ../..
pip install .

Here I see no error messages, and the C version can be correctly imported.

Any idea what's causing this difference?

wannesm commented 1 year ago

You are not running the make command in the jinja directory in the first version? This is not automatically triggered by the setup.py file because we didn't want to make jinja a required dependency. We add the generated files directly to the git repository.

The exact error you see is also because I further improved the independence of the exact datatype used throughout the code. Almost all occurrences of 'double' have been removed from the C and Cython code and all types are now covered a by a typedef of seq_t. This is done in dtaidistance_globals.pxd (which mirrors dd_globals from the c code). And if you don't run the makefile, this is still set to 'double' for the Cython code (or add dtaidistance_globals.pxd to your sed command).

Ps: To make it easier, I added makefile rules to easily switch between types. You can now drop the sed command and simply do:

cd dtaididistance/jinja && make float && cd ../..
tommedema commented 1 year ago

That worked! My bad for not running make with the setup script. The new make float is sweet, thanks for that :)

Do you recommend using the setup script or just using pip install .? I'm not sure I understand the difference.

BTW- I did see this warning (repeated about 200x) when running make:

/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/_stdio.h:93:16: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
        unsigned char   *_base;

Full log at https://gist.github.com/tommedema/938ce28659a3257513fc0d819b0dd355

Not sure if it matters, since everything seems to work just fine.