Closed korbinian-hoermann closed 1 year ago
This is an interesting example. But have you made sure that your OPPONENT_OBS
in your postprocessing fn is built for the correct number of opponent agents? I'm only seeing you do this (loss not initialized yet):
sample_batch[OPPONENT_OBS] = np.zeros_like(
sample_batch[SampleBatch.CUR_OBS])
which is only 0s for one opponent. That's why you get the shape error: You are passing in data for 1 opponent, instead of 3.
This is an interesting example. But have you made sure that your
OPPONENT_OBS
in your postprocessing fn is built for the correct number of opponent agents? I'm only seeing you do this (loss not initialized yet):sample_batch[OPPONENT_OBS] = np.zeros_like( sample_batch[SampleBatch.CUR_OBS])
which is only 0s for one opponent. That's why you get the shape error: You are passing in data for 1 opponent, instead of 3.
Hello sven1977, I want use PPO via a centralized critic class also. I try to custom it. But, I do not know Which classes must be modified? Agents, action space, state space must be modified, right?
centralized_critic_postprocessing,loss_with_central_critic, setup_tf_mixins, central_vf_stats,LearningRateSchedule, EntropyCoeffSchedule, KLCoeffMixin,CentralizedValueMixin
I want to know these function or class whether should be modified. Thank you!
Thank you @sven1977 ! I had to do some more adjustments in the postprocessing fn but it seems to work now. 👍
@zzchuman those are the functions I had to modify:
init function of CentralizedCriticModel(TFModelV2) (adjust the inputs of the net to suit your agent's and opponent agents' acttions/ observations)
centralized_critic_postprocessing as sven mentioned above: built OPPONENT_OBS for the correct number of opponent agents.
get_policy_class Using the tf framework the original one raised an error so I changed it to this
def get_policy_class(config): if config["framework"] == "torch": return CCPPOTorchPolicy else return CCPPOTFPolicy
Thank you @sven1977 ! I had to do some more adjustments in the postprocessing fn but it seems to work now. 👍
@zzchuman those are the functions I had to modify:
- init function of CentralizedCriticModel(TFModelV2) (adjust the inputs of the net to suit your agent's and opponent agents' acttions/ observations)
- centralized_critic_postprocessing as sven mentioned above: built OPPONENT_OBS for the correct number of opponent agents.
- get_policy_class Using the tf framework the original one raised an error so I changed it to this
def get_policy_class(config): if config["framework"] == "torch": return CCPPOTorchPolicy else return CCPPOTFPolicy
Hello korbin, congratulations! I have some questions:
There is a problem with the SampleBatches.concat_samples method. This method generates the output sample batch in the format ((num_agents-1) length_sample_batch, obs_dim) and what we want is a output sample batch in the format of (length_sample_batch, (num_agents-1)obs_dim), This dimension should be followed while initialization of the opponent_obs/opponent action as well. I have made a new concat_batches function in SampleBatch Class to enable the usage of central critic for more than two agents if the community wants I can make a pull request for this change.
Currently attempting the exact same thing. Is it possible to restate the complete correct implementation of your centralized crititc with multiple agent? Maybe adapting the example to support mulitple agents right of the batch would ease this for many users.
Currently attempting the exact same thing. Is it possible to restate the complete correct implementation of your centralized crititc with multiple agent? Maybe adapting the example to support mulitple agents right of the batch would ease this for many users.
hello Has anyone implemented it?
What is the problem?
ray 1.0.1, Python 3.7, TensorFlow 2.3.1, Windows 10
Hi!
I am trying to solve the following environment with the MAPPO (PPO with a centralized critic)
Reward For each time step a agents is not in its final position, it receives a reward of -1 For each time step a agents is in its final position, it receives a reward of 0
Actions
Observation For each agent an obs consists of:
Resulting in the following observation and action spaces of the environment:
_action_space = spaces.Discrete(5) observationspace = spaces.Box(np.array([0., 0., 0., 0.]), np.array([1., 1., 1., 1.]))
One episode lasts for 50 time steps. The goal for all agents is to get into their final position (cell which has the same colour as the corresponding agent) as fast as possibe and stay in there until the episode ends.
I was able to solve this environment with 2 agents, following rllibs’s centralized critic example.
In order to handle a increased number of agents, I made following changes to the example code (see section "centralized critic model", I am only using the TF version):
This results in the following error message:
Did anyone manage to adjust the number of agents in the centralized_critic.py example or has an idea what else I have to change?
Thank you in advance!
Cheers, Korbi :)
Reproduction (REQUIRED)