Is the priority computation correct in prioritized DQN

google-deepmind / acme

A library of reinforcement learning components and agents

Apache License 2.0

3.5k stars 426 forks source link

Is the priority computation correct in prioritized DQN #205

Open ethanluoyc opened 2 years ago

ethanluoyc commented 2 years ago

Hi,

Cross posting a related issue here https://github.com/deepmind/reverb/issues/91

After looking at the DQN agent implementation in Acme, I am not sure if the prioritized variant matches the original PER paper.

Specifically, in the original PER, the priority for newly inserted transitions is the maximum of the priority seen so far. However, in Acme, the priority is uniformly set to be 1.0. This deviates from the original PER in that the newly inserted transitions will not be sampled immediately if 1.0 is comparably smaller than the typical TD error in training.

I looked into if it's possible to get the maximum priority in reverb but haven't found a nice way to do so.

nikolamomchev commented 2 years ago

Hi @ethanluoyc ,

Indeed it seems that the TD error in Acme is NOT clipped. I am not sure if this causes any performance issues, however, do you have an indication that it does so?

I believe @nino-vieillard recently managed to successfully reproduce DQN results, maybe he can shine some light into whether this matters.

FWIW, it seems that it would be easy to clip the td_error for the priority update. Would this help?

ethanluoyc commented 2 years ago

I have not had the chance to investigate this.

Actually, I came across this problem in a different context where I want to implement prioritized sampling in a continuous control agent. I would like to implement a behavior where newly inserted experience will be immediately sampled following the insertion, this can be implemented if the adder knows about the current max priority, which seems tricky to do right now in Acme.

nikolamomchev commented 2 years ago

I'm approaching this naively and without much knowledge of what you are trying to do. But could you squash the priority in some fixed range (e.g. (0, 1]) and insert fresh experience with priority of 1?

ethanluoyc commented 2 years ago

Hmm I guess squashing would cause issues with changing the relative importance. My solution now is to just not use reverb for this case :D In case you are interested, I would like to replicate this paper with Acme. https://arxiv.org/abs/2107.00591. They have some setup with using prioritized replay which is currently difficult to do.