I am implementing the PPO algorithm on this environment. I succesfully run a few experiments in the single agent simple environment which I used for debugging. Now I am trying to scale the code in order to be compatible to multiagent setting as well.
I can understand the theoritical concept of the centralized learning-decentralized execution approach, but I am quite confused about the coding-engineering changes to be done in the update of the networks in the PPO algo.
I think that the actor network (if shared layers is not the case) will use each agent's actor loss to update the network, but how the critcs are updated? Should I calculate a cummulative critic loss and backpropagate it in every critic network?
Hi everyone,
I am implementing the PPO algorithm on this environment. I succesfully run a few experiments in the single agent simple environment which I used for debugging. Now I am trying to scale the code in order to be compatible to multiagent setting as well.
I can understand the theoritical concept of the centralized learning-decentralized execution approach, but I am quite confused about the coding-engineering changes to be done in the update of the networks in the PPO algo.
I think that the actor network (if shared layers is not the case) will use each agent's actor loss to update the network, but how the critcs are updated? Should I calculate a cummulative critic loss and backpropagate it in every critic network?