Final Project for Reinforcement Learning 2023/2024 at Sapienza University. Our goal is to reproduce the paper Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning, but change it to apply to discrete instead of continuous spaces. The authors have provided an implementation of the method that we can use for inspiration.
There is also maddpg repo from OpenAI that we could utilise.
We will use the PressurePlate environment. We can directly install this environment from its github repo:
pip install git+git@github.com:uoe-agents/pressureplate.git
The environment can be set up like a gym environment, it is important to import the pressureplate
module
at the top so that it can be loaded by the gym
package.
import gym
import pressureplate
env = gym.make('pressureplate-linear-4p-v0')
env.reset()
The action space is discrete with 5 actions:
You need to pass an action for each agent to the environment, e.g. for 4 agents:
observations, rewards, dones, _ = env.step([0, 0, 0, 0])
The step()
function returns the following:
Each agent only sees the subset of the entire environment that is around them as layered 2D grids. It can be controlled with the sensor_range
attribute of the PressurePlate
class. The default value is 4 and so the agent sees 4 steps in each direction,
creating a 5x5 grid around themselves, since distances are measured with the Manhattan distance.
The 4 grids that each agent can see are the:
The 4 grids are flattened from 5x5 into 1x25 and then concatenated into 1x100. Finally, the agent's x,y coordinates are concatenated at the end, giving us an observation of size 1x102.
# agent is at coordinates 5,13
[0, 0, 0, 0, 0, 0, 1, 1, 0, 0,...,5, 13]
The reward is individual for each agent and is the Manhattan distance to that agent's pressure plate.
We adopt the decentralized execution framework, so each agent has its own Q function and policy network.
The observation encoder $\theta$ encodes the different entities inside the environment into a higher dimension. The four entities that agents can see are:
Each of these is embedded from $\mathbb{R}^{25}$ to $\mathbb{R}^{512}$ by a two-layer neural network (Entity Encoder), denoted as $g$ in the diagram below. The environment observations consist of entity grids that are flattened by the game and returned as observations into a vector in $\mathbb{R}$. We then unfold this vector into a matrix where each row is an entity. Finally, we apply a dedicated entity encoder to each entity and stack the results into a matrix of shape $\mathbb{R}^{4 \times 512}$.
The Observation Action encoder $f$ receives an observation from the environment and first uses the Observation Encoder to embed it in a higher dimension. The observation is global in the sense that it is what the pressureplate environment returns and it contains the information about all agents in the game. Each agent has their own Observation Action Encoder which they use to weigh the observations of the other agents that are close to them, and embedding the weighted observations as well as the encoded action into the embedding dimension.
Attention is an integral part of the Observation Encoder as well as of the Q function. It is used to calculate weights $\alpha$ that are used to weigh the observations of the other visible agents along the entity dimension. This means that the attention scores are calculated by pairwise inner product between the agent owning the Observation Action Encoder/Q function and the other visible agents, for each entity separately. I.e. for the Observation Action Encoder of agent A, we calculate the attention as the inner product between the plates of agent A and agent B, the doors of agent A and agent B etc. Essentially what we are doing is finding which parts of what the other agents are seeing is important to the agents.
For the training of multiple agents the MADDPG has been implemented. It can be summarized as follow:
Install as a package
pip install -e .
The MADDPG module has the following parameters, all with default values.
MADDPG RL Parameters
options:
-h, --help show this help message and exit
--num_agents NUM_AGENTS
Number of agents (default: 4)
--episodes EPISODES Number of episodes (default: 1000)
--steps_per_episode STEPS_PER_EPISODE
Number of steps per episode (default: 100)
--batch_size BATCH_SIZE
Batch size (default: 1024)
--buffer_size BUFFER_SIZE
Replay buffer size (default: 1000000)
--learning_rate LEARNING_RATE
Learning rate (default: 0.01)
--gamma GAMMA Discount factor (default: 0.95)
--tau TAU Soft update parameter (default: 0.01)
--verbose Print training progress (default: False)
--render Render the environment after each step (default: False)
The MAPPO module has the following parameters:
MAPPO RL Parameters
options:
-h, --help show this help message and exit
--num_agents NUM_AGENTS
Number of agents (default: 2)
--num_episodes NUM_EPISODES
Number of episodes (default: 1)
--max_steps MAX_STEPS
Number of steps per episode (default: 500)
--kan KAN Bool value if usining KAN networks for Actor and Critic (default: False)
--render Render the environment after each step (default: False)
--print_freq PRINT_FREQ
Print frequence wrt episodes (default: 1)
--parallel_games PARALLEL_GAMES
The number of parallel games to play for the EPC. (default: 3)
--ppo_clip_val PPO_CLIP_VAL
PPO clip value (default: 0.2)
--policy_lr POLICY_LR
Learning rate for the policy network (default: 2e-05)
--value_lr VALUE_LR Learning rate for the value network (default: 2e-05)
--entropy_coef ENTROPY_COEF
Entropy coefficient (default: 0.001)
--target_kl_div TARGET_KL_DIV
Target KL divergence (default: 0.02)
--max_policy_train_iters MAX_POLICY_TRAIN_ITERS
Maximum iterations for policy training (default: 1)
--value_train_iters VALUE_TRAIN_ITERS
Number of iterations for value training (default: 1)
To execute the module with default values, do:
python -m maddpg_epc
Otherwise you can also do this to run the training, execute:
python __main__.py
The training algorithm logs to tensorboard, to start the tensorboard server:
tensorboard --logdir=runs
To run just the MAPPO:
python -m mappo_epc.mappo
To run the MAPPO with EPC:
python -m mappo_epc