Taking a step back for a moment, in RL, agents act in an environment. Roughly speaking, we can split our design problem into two components: simulation and algorithm.
Simulation
agents (have an action space) and entities (do not have an action space) interact in an environment. I think it makes sense to begin by describing the state and action spaces.
We can similarly define Action. Dynamics would map actions to state updates. Agents would then hold instances of State and Action, which are used to create an instance of Dynamics. Agents would also hold an instance of Reward which we construct recursively using the decorator pattern. Same for Observation. This design enables multi-customer as well as multi-product. It also enables us to create different agent archtypes straightforwardly via a config file. The decorator-based reward and observation enables easy extensibility to create more complex behaviours - for example, enabling the simple creation of colluding agents.
The Environment would follow the OpenAI gym interface. env.step would call some version of the query distributor.
Algorithm
Each Algorithm would map to a single agent in the environment. The Algorithm could be a bandit, an optimal control algorithm, or even just some heuristic policy (e.g., TrafficGenerator). Algorithms would contain a Buffer. For our purposes, we could likely repurpose the buffer I've implemented in cadr.
Motivation
Support multi-customer and multi-product.
In Scope
Out Of Scope
Proposed High-Level Design
Taking a step back for a moment, in RL, agents act in an environment. Roughly speaking, we can split our design problem into two components:
simulation
andalgorithm
.Simulation
agents
(have an action space) andentities
(do not have an action space) interact in anenvironment
. I think it makes sense to begin by describing the state and action spaces.We can similarly define
Action
.Dynamics
would map actions to state updates.Agents
would then hold instances ofState
andAction
, which are used to create an instance ofDynamics
. Agents would also hold an instance ofReward
which we construct recursively using the decorator pattern. Same forObservation
. This design enables multi-customer as well as multi-product. It also enables us to create different agent archtypes straightforwardly via a config file. The decorator-based reward and observation enables easy extensibility to create more complex behaviours - for example, enabling the simple creation of colluding agents.The
Environment
would follow the OpenAI gym interface.env.step
would call some version of the query distributor.Algorithm
Each
Algorithm
would map to a single agent in the environment. TheAlgorithm
could be a bandit, an optimal control algorithm, or even just some heuristic policy (e.g.,TrafficGenerator
). Algorithms would contain aBuffer
. For our purposes, we could likely repurpose the buffer I've implemented incadr
.Experiment Tracking
Just use sacred.