Closed dmuestc closed 6 years ago
The prioritized_memory
is actually a contributed feature, and only the naive version is from us... But regardless, why do you want to return the importance sampling weights, or what exactly do you mean by that? The update_batch
method here takes care of updating the priority score, but returning shouldn't really be necessary, right? The problem of unstable performance might have many reasons, have you e.g. tried the DQN config in our benchmark project?
Thanks @AlexKuhnle reply, my reason as follows:
Prioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to (even if the policy and state distribution are fixed). We can correct this bias by using importance-sampling (IS) weights.
So we will update the priority score in the update_batch
method, and also correct the sampling bias by loss_per_instance*IS_weights
, when computing the tf_loss_per_instance
.
thanks @AlexKuhnle again.
Hmm, I would have to check the paper again, since I don't know where these additional weights should come from? You might be right, though, that the way it currently works is not optimal.
We aim to soon provide some benchmark plots and configs for various environments, but well-performing configs from the community are of course very welcome. If you have some, feel free to create a PR, ideally for the benchmark project.
I run a dqn example, use comand: "python examples/openai_gym.py CartPole-v0 -a examples/configs/dqn.json -n examples/configs/mlp2_network.json" , but I set memory type as "prioritized_replay". It works fine in the first 200~300 episodes, but when the average reward arrives at about 200, later the reward rapidly drops to about 10, and never goes up. So I look into the prioritized replay buffer code, and find no return importance sampling weights when do mini-batch sampling, and I reference at: The CNTK implementation, and the baseslines implementation, both of them has returned weights, I'm not sure, will it be an issue?