question about reward - Githubissues

shariqiqbal2810 / MAAC

Code for "Actor-Attention-Critic for Multi-Agent Reinforcement Learning" ICML 2019

MIT License

676 stars 173 forks source link

question about reward #23

Closed ShuangLI59 closed 4 years ago

ShuangLI59 commented 4 years ago

why the average rewards reported in the paper is much higher than the code. it's ~6 after training, but the paper is 125. Did you change the reward in the environment? and at the end of each episode, for example, in the multi_speaker_listener environment, the listener cannot reach to its target position. Is this the same as your results?

shariqiqbal2810 commented 4 years ago

The code averages rewards over timesteps (25 steps in multi_speaker_listener), and the paper does not. So you need to multiply the rewards in the code by the number of timesteps to get the results in the paper (i.e. 6 * 25 = 125).

If your runs are reaching that level of rewards (around 6), then the speakers should be consistently reaching their targets. How are you checking this?

ShuangLI59 commented 4 years ago

Thanks for answering. Yes, I visualize the rendered image after training. Does the PyTorch/OpenAI baselines/OpenAI Gym version influence the performance?

shariqiqbal2810 commented 4 years ago

The fact that your runs are achieving that level of rewards indicates that they are training properly. Without more information I can't be sure what's wrong. Are you loading the parameters of the trained model before visualization? Can you share some examples of what the rendered images look like? Also, it would be useful to see the code you're using to visualize the policies.

ShuangLI59 commented 4 years ago

The vis code is similar to https://github.com/shariqiqbal2810/maddpg-pytorch/blob/master/evaluate.py. The generated results are. epi0

shariqiqbal2810 commented 4 years ago

You should check whether the rollouts in evaluate.py are leading to the same amount of rewards that you see at the end of training. It's pretty clear that that is not the case here, which indicates there is a problem with how you are loading the parameters or something else along those lines.

ShuangLI59 commented 4 years ago

test.zip This is the code I used to visualize.

shariqiqbal2810 commented 4 years ago

Sorry, I don't see anything that stands out as problematic in that code. Since you are getting good results during training, I would recommend trying to match the code within the training procedure as closely as possible and figuring out where the difference is.

ShuangLI59 commented 4 years ago

I see, so this is different from your testing results, right? Maybe there some bugs in my code.

shariqiqbal2810 commented 4 years ago

Yes, I was able to visualize successful trials where the listeners reach their targets, so I'm not exactly sure what's going wrong here. Good luck! I will close this issue for now, but feel free to comment if you have any other questions.

ShuangLI59 commented 4 years ago

Thanks a lot for your help!