stsievert / salmon

A tool to collect triplet queries
https://docs.stsievert.com/salmon/
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

MAINT: run searches and updates in parallel #66

Closed stsievert closed 4 years ago

stsievert commented 4 years ago

What does this PR implement? It runs the query search and model updates in parallel. Functionally, it implements this code:

while True:
    future = client.submit(update)
    for pwr in itertools.count(start=10):
        queries = get_queries(num=2**pwr)
        if future.done():
            break

This PR also does the following:

Reference issues/PRs This PR will close #61 and will close #35.

stsievert commented 4 years ago

I debated using Dask or Ray Actors for this. I chose not to because using an actor would mean that the search is always using the current model, even if it's being updated concurrently. That means we'll have to serialize the model at least once. Luckily this doesn't take too long, 84μs for 100k answers:

Timings w/ 100k answers ``` python [ins] In [1]: import numpy as np [ins] In [3]: n = 85 [ins] In [4]: posterior = np.random.uniform(size=(n, n)).astype("float32") [ins] In [5]: answers = np.random.choice(n, size=(100_000, 3)).astype("uint16") [ins] In [6]: embedding = np.random.uniform(size=(n, 3)).astype("float32") [ins] In [7]: import pickle [ins] In [8]: %timeit pickle.dumps((posterior, answers, embedding)) 84.6 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) [ins] In [15]: len(pickle.dumps((posterior, answers, embedding))) / 1024 Out[15]: 615.4404296875 # 615KB ```

If the number of answers is 20k, the serialization time is 50μs. The object is 615KB, so it's not huge.

stsievert commented 4 years ago

From https://github.com/stsievert/salmon/issues/35#issuecomment-621483808:

This still provides all the features in https://github.com/stsievert/salmon/issues/35#issuecomment-668657038.