MAINT: run searches and updates in parallel

stsievert commented 4 years ago

What does this PR implement? It runs the query search and model updates in parallel. Functionally, it implements this code:

while True:
    future = client.submit(update)
    for pwr in itertools.count(start=10):
        queries = get_queries(num=2**pwr)
        if future.done():
            break

This PR also does the following:

Better handles errors (and reworks some of the tests)

Reference issues/PRs This PR will close #61 and will close #35.

stsievert commented 4 years ago

I debated using Dask or Ray Actors for this. I chose not to because using an actor would mean that the search is always using the current model, even if it's being updated concurrently. That means we'll have to serialize the model at least once. Luckily this doesn't take too long, 84μs for 100k answers:

Timings w/ 100k answers

``` python [ins] In [1]: import numpy as np [ins] In [3]: n = 85 [ins] In [4]: posterior = np.random.uniform(size=(n, n)).astype("float32") [ins] In [5]: answers = np.random.choice(n, size=(100_000, 3)).astype("uint16") [ins] In [6]: embedding = np.random.uniform(size=(n, 3)).astype("float32") [ins] In [7]: import pickle [ins] In [8]: %timeit pickle.dumps((posterior, answers, embedding)) 84.6 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) [ins] In [15]: len(pickle.dumps((posterior, answers, embedding))) / 1024 Out[15]: 615.4404296875 # 615KB ```

If the number of answers is 20k, the serialization time is 50μs. The object is 615KB, so it's not huge.

stsievert commented 4 years ago

From https://github.com/stsievert/salmon/issues/35#issuecomment-621483808:

[x] ... task durations should be greater than 10-100ms.
- Model updates take at least 10ms, probably at least 100ms. Probably more like 1s.
- Searching 2048 queries takes 10ms with n=85 objects and embedding into d=2 dimensions (so at least 2048 queries are searched).
[ ] [won't fix] The client should use Client(asynchronous=True).
- The client runs in a background thread, so this isn't an issue.
[ ] [won't fix] In production you may wish to run dask-workers within containers (source).
- It's easier to launch the dask-workers in the same container.
[ ] [won't fix] Dask enables the remote execution of arbitrary code. You should only host dask-workers within networks that you trust. (source) (read: the Dask port should use expose not ports).
- Dask profiling is only for debugging (or worst case, SSH'ing with port forwarding). Only port 8787 is exposed to the local machine with Docker. The documentation is not modified, so EC2 machines will still have port 8787 closed.
[ ] [won't fix] Make sure worker stderr is accessible ...
- All backend exceptions are logged, as per the tests.

This still provides all the features in https://github.com/stsievert/salmon/issues/35#issuecomment-668657038.

stsievert / salmon

MAINT: run searches and updates in parallel #66