tensorforce / tensorforce

Tensorforce: a TensorFlow library for applied reinforcement learning
Apache License 2.0
3.3k stars 532 forks source link

No importance sampling weights in prioritized replay buffer, will it be an issue? #267

Closed dmuestc closed 6 years ago

dmuestc commented 6 years ago

I run a dqn example, use comand: "python examples/openai_gym.py CartPole-v0 -a examples/configs/dqn.json -n examples/configs/mlp2_network.json" , but I set memory type as "prioritized_replay". It works fine in the first 200~300 episodes, but when the average reward arrives at about 200, later the reward rapidly drops to about 10, and never goes up. So I look into the prioritized replay buffer code, and find no return importance sampling weights when do mini-batch sampling, and I reference at: The CNTK implementation, and the baseslines implementation, both of them has returned weights, I'm not sure, will it be an issue?

AlexKuhnle commented 6 years ago

The prioritized_memory is actually a contributed feature, and only the naive version is from us... But regardless, why do you want to return the importance sampling weights, or what exactly do you mean by that? The update_batch method here takes care of updating the priority score, but returning shouldn't really be necessary, right? The problem of unstable performance might have many reasons, have you e.g. tried the DQN config in our benchmark project?

dmuestc commented 6 years ago

Thanks @AlexKuhnle reply, my reason as follows:

  1. From the paper prioritized reply buffer, we know:

Prioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to (even if the policy and state distribution are fixed). We can correct this bias by using importance-sampling (IS) weights.

So we will update the priority score in the update_batch method, and also correct the sampling bias by loss_per_instance*IS_weights, when computing the tf_loss_per_instance.

  1. I tried the benchmark project, it works very well by default configs. For other classic control game like Pendulum-v0, It seems that PPO/TRPO don't converge well even not converge. Is it parameter sensitive? I'll try do some hyper-parameter tunes in the configs, and try more experiments. Hopefully, it will be cool and be of a great value if have more effective configs including kinds of environments, like simple games, atari, mujoco....

thanks @AlexKuhnle again.

AlexKuhnle commented 6 years ago

Hmm, I would have to check the paper again, since I don't know where these additional weights should come from? You might be right, though, that the way it currently works is not optimal.

We aim to soon provide some benchmark plots and configs for various environments, but well-performing configs from the community are of course very welcome. If you have some, feel free to create a PR, ideally for the benchmark project.