Need clarification on the concept. Looks like share observation is performing better?

heng2j commented 2 years ago

Hi @janblumenkamp and @matteobettini ,

Looks like with share_observations with multiple agents / actors always perform better in both continuous and discrete action spaces. With share-observations, are we still archiving to learn with a centralized critics?

BTW, very good work as always!

Thank you, Heng

janblumenkamp commented 2 years ago

Hi Heng! The scenario we include here is just a toy example to demonstrate that we are able to train a model using differentiable communication channel. The environment is constructed in a way that only with a shared model, the task can be solved (otherwise the agents don't know about each other's goals). The environment can be replaced with a more realistic one and the model for example with a GNN. Hope that answers your question!

heng2j commented 2 years ago

Hi @janblumenkamp ,

Thank you so much for getting back to me! I was a little bit confused on the differences of "comm" vs "no comm". With limited context, I thought "comm" aka share_observations=True meant non-differentiable communication. And "no comm" aka share_observations=False meant differentiable communication.

After revisiting the code and the README, I think I may had the wrong impression. Since with "no comm" we are squeezing and blending all the observation into a vector with the size of the encoder_out_features. And with "comm" we extended the vector with the size of number of features times number of agents. As shown in line 53 in model.py. But either "comm" or "no comm", we are still training one single policy that use by multiple agents right?

I would love to understand more on your approach, since my team is also trying to solve a similar credit assignment issue where we are using a single policy to control multiple actors (not agents). Think of chess master and chess pieces relationship. We will receive unique observations from each actor. But it is challenging for us to precisely determine the rewards for each actor and feed the rewards properly back to the learning network due to the limitation of of RLlib for having a unified reward at each time step as you mentioned in the repo. After spent some time on literature reviews and googling around, the most suitable implementation with RLlib is indeed your work. Your work here is the most relavent solution that aligned with our problems. Therefore, I would like to understand more about your approach.

Cheers, Heng

proroklab / rllib_differentiable_comms

Need clarification on the concept. Looks like share observation is performing better? #2