ENH: Allow using different sampling algorithms

stsievert / salmon

A tool to collect triplet queries

https://docs.stsievert.com/salmon/

BSD 3-Clause "New" or "Revised" License

9 stars 2 forks source link

ENH: Allow using different sampling algorithms #22

Closed stsievert closed 4 years ago

stsievert commented 4 years ago

What does this PR implement? It allows running different sampling algorithms. This might include random sampling or different adaptive algorithms.

TODO

[x] Test the implementation is working correctly. Make sure queries and answers are being received by the algorithm.
[x] integrate query scores
[x] better integration of algs.run.
[x] properly reset when new experiment initialized
[x] find clean way to communicate queries through Redis
[x] catch argument errors when initializing algorithms
~~[ ] integrate Dask~~

A good dummy for this might be a "round-robin" algorithm where the head is selected at random and the bottom items are selected randomly (for now).

Future work:

integrate Dask (right now it's all running on a single thread)
better document/test setting parameters in adaptive algorithms

stsievert commented 4 years ago

This PR implements includes a Docker machine to run the different adaptive sampling algorithms. This backend has two endpoints: /init and /model. It specifically does not have endpoints for /get_query or /process_answer. That way, the serving of queries and the computation of queries are completely separated. As a consequence, errors on the backend are not caught, the system hangs until a query is computed and the backend is continuously running.

stsievert commented 4 years ago

I've implemented a manager class/module to separate out the logic of serving queries and retrieving queries from the algorithms.

stsievert commented 4 years ago

It specifically does not have endpoints for /get_query

Here's an implementation that works with the current implementation and works with /get_query:

Have endpoints on the backend with computation happening in Dask.
By default, post queries to Redis with alg.get_queries. If the alg implements get_query, directly return the query.

This implementation is more flexible, and the web client will communicate directly with the algorithm (so cleaner code). All the computation will be handled with Dask, so I don't see think we need to worry about overloading the server.

For example, what if algorithm is random with focus on process answers? Define a get_query function and define run to only process answers as they're received. What if the query is returned by some model? Run the model with the context provided to get_query.