Centralized learning-decentralized execution clarification (engineering perspective)

Hi everyone,

I am implementing the PPO algorithm on this environment. I succesfully run a few experiments in the single agent simple environment which I used for debugging. Now I am trying to scale the code in order to be compatible to multiagent setting as well.

I can understand the theoritical concept of the centralized learning-decentralized execution approach, but I am quite confused about the coding-engineering changes to be done in the update of the networks in the PPO algo.

I think that the actor network (if shared layers is not the case) will use each agent's actor loss to update the network, but how the critcs are updated? Should I calculate a cummulative critic loss and backpropagate it in every critic network?

openai / multiagent-particle-envs

Centralized learning-decentralized execution clarification (engineering perspective) #79