Odd "ValueError: buffer source array is read-only" error with dtaidistance inside ipyparallel

tommedema commented 2 years ago

I'm trying to compute the DTW distances to various queries against many other time series. For each query I run subsequence_search with an array of time series (i.e. an array of arrays):

def getQuerySeriesResults(q, series, additions, minMatchCount, maxMatchCount):
    sa = subsequence_search(q, series, dists_options={'use_c': True})

    best = sa.kbest_matches_fast(k = maxMatchCount)

    return getSearchParameters(best, sa.distances, additions, minMatchCount)

Since I have many queries to compare against time series, this warrants a use case for parallelization, and so I setup a cluster with ipyparallel:

import ipyparallel as ipp

clusterProcessesCount = 4

cluster = ipp.Cluster(n = clusterProcessesCount)
cluster.start_cluster_sync()
rc = cluster.connect_client_sync()
rc.wait_for_engines(clusterProcessesCount)
lview = rc.load_balanced_view()
dview = rc[:]

It works fine when running arbitrary python code:

tests = np.array([1, 2, 3])

def f(test):
    return test

for result in lview.imap(f, tests, ordered = False, max_outstanding = 'auto'):
    print(result)

However, as soon as I invoke sa.kbest_matches_fast inside one of the child processes, I get this exception:

exception calling callback for <AsyncResult(<ipyparallel.serialize.serialize.PrePickled object at 0x7f7e2a170f70>): failed>
Traceback (most recent call last):
  File "/Users/tommedema/opt/anaconda3/lib/python3.9/site-packages/ipyparallel/client/asyncresult.py", line 528, in _resolve_result
    raise r
ipyparallel.error.RemoteError: [0:apply] ValueError: buffer source array is read-only

The entire log can be found here

I could ask in the ipyparallel repo too though the issue seems to only occur with dtaidistance when setting use_c to True and so I am wondering if you might have an idea what is going on here?

tommedema commented 2 years ago

I just found that when I disable C ("use_c": False) this error does not occur. However, that defeats the purpose of my performance optimization with parallelization.

If it helps, I can create a minimal reproducible example using Google Collab (though I'm not sure if I can run child processes there). Please let me know and I really appreciate any help on this.

wannesm commented 2 years ago

A minimal reproducible example would be helpful, yes. I do not seem to be able to repeat the error when using ipyparallel. And I can not immediately see where a non-writable view would be created.

You could also test the inputs (query and series) to the subsequence_alignment function with the following test to check if this is triggered by an operation before calling dtaidistance:

if query.flags.writeable is False:
    assert('query is not writeable')
if series.flags.writeable is False:
    assert('series is not writeable')

tommedema commented 2 years ago

@wannesm brilliant! That taught me that somehow the input query is not writable. I was able to resolve the issue by wrapping it with: q = np.array(q). Even though it already was a numpy array (from np.take after to_numpy from pandas), which is weird, but at least it works now.

wannesm / dtaidistance

Odd "ValueError: buffer source array is read-only" error with dtaidistance inside ipyparallel #178