Closed ShuangLI59 closed 4 years ago
The code averages rewards over timesteps (25 steps in multi_speaker_listener), and the paper does not. So you need to multiply the rewards in the code by the number of timesteps to get the results in the paper (i.e. 6 * 25 = 125).
If your runs are reaching that level of rewards (around 6), then the speakers should be consistently reaching their targets. How are you checking this?
Thanks for answering. Yes, I visualize the rendered image after training. Does the PyTorch/OpenAI baselines/OpenAI Gym version influence the performance?
The fact that your runs are achieving that level of rewards indicates that they are training properly. Without more information I can't be sure what's wrong. Are you loading the parameters of the trained model before visualization? Can you share some examples of what the rendered images look like? Also, it would be useful to see the code you're using to visualize the policies.
The vis code is similar to https://github.com/shariqiqbal2810/maddpg-pytorch/blob/master/evaluate.py. The generated results are.
You should check whether the rollouts in evaluate.py
are leading to the same amount of rewards that you see at the end of training. It's pretty clear that that is not the case here, which indicates there is a problem with how you are loading the parameters or something else along those lines.
test.zip This is the code I used to visualize.
Sorry, I don't see anything that stands out as problematic in that code. Since you are getting good results during training, I would recommend trying to match the code within the training procedure as closely as possible and figuring out where the difference is.
I see, so this is different from your testing results, right? Maybe there some bugs in my code.
Yes, I was able to visualize successful trials where the listeners reach their targets, so I'm not exactly sure what's going wrong here. Good luck! I will close this issue for now, but feel free to comment if you have any other questions.
Thanks a lot for your help!
why the average rewards reported in the paper is much higher than the code. it's ~6 after training, but the paper is 125. Did you change the reward in the environment? and at the end of each episode, for example, in the multi_speaker_listener environment, the listener cannot reach to its target position. Is this the same as your results?