Implementation of v2 - Githubissues

Motivation

Support multi-customer and multi-product.

In Scope

Multi customer
Multi Product
Experiment tracking
Generic query-distributor support via adapters

Out Of Scope

callbacks
MARL support

Proposed High-Level Design

Taking a step back for a moment, in RL, agents act in an environment. Roughly speaking, we can split our design problem into two components: simulation and algorithm.

Simulation

agents (have an action space) and entities (do not have an action space) interact in an environment. I think it makes sense to begin by describing the state and action spaces.

from abc import ABC, abstractmethod
import numpy as np

class State(ABC):
    def __init__(self):
        super().__init__()

    @abstractmethod
    def reset(self):
        pass

class PriceMultiplier(State):
    def __init__(self, default: float):
        super().__init__()
        self.default = default
        self.state = default

    def reset(self):
        self.state = self.default

class Price(State):
    def __init__(self, nproducts: int):
        super().__init__(nproducts)
        self.nproducts = nproducts
        self.state = np.zeros(nproducts)

    def reset(self):
        self.state = np.zeros(self.nproducts)

class Budget(State):
    def __init__(self, default: float):
        super().__init__()
        self.default = default
        self.state = default

    def reset(self):
        self.state = self.default

def statefactory(**config):
    return state

We can similarly define Action. Dynamics would map actions to state updates. Agents would then hold instances of State and Action, which are used to create an instance of Dynamics. Agents would also hold an instance of Reward which we construct recursively using the decorator pattern. Same for Observation. This design enables multi-customer as well as multi-product. It also enables us to create different agent archtypes straightforwardly via a config file. The decorator-based reward and observation enables easy extensibility to create more complex behaviours - for example, enabling the simple creation of colluding agents.

The Environment would follow the OpenAI gym interface. env.step would call some version of the query distributor.

Algorithm

Each Algorithm would map to a single agent in the environment. The Algorithm could be a bandit, an optimal control algorithm, or even just some heuristic policy (e.g., TrafficGenerator). Algorithms would contain a Buffer. For our purposes, we could likely repurpose the buffer I've implemented in cadr.

Experiment Tracking

Just use sacred.

semiotic-ai / autoagora-agents

Implementation of v2 #29