google-deepmind / reverb

Reverb is an efficient and easy-to-use data storage and transport system designed for machine learning research
Apache License 2.0
704 stars 93 forks source link

Get the current max priority in a table. #91

Open ethanluoyc opened 2 years ago

ethanluoyc commented 2 years ago

Hi Reverb team,

I am interested in using a prioritized experience replay on top of an Acme agent that inserts a new experience by setting the priority to the current maximum priority in the buffer. I have looked around but haven't found a good way to do this. Is there a recommended approach to do this in reverb?

Many thanks in advance!

sabelaraga commented 2 years ago

Hi Yicheng,

Could you give us more details on how are you planning to sample? From what you describe, if the latest inserted is the one with higher priority, you may want to take a look at the Lifo selectors.

Sabela.

ethanluoyc commented 2 years ago

Hi Sabela,

Thanks for the reply! I am basically trying to implement the exact PER scheme used in https://arxiv.org/abs/1511.05952.

The dqn_zoo prioritized DQN agent does what I want.

https://github.com/deepmind/dqn_zoo/blob/master/dqn_zoo/prioritized/agent.py

However, compared with prioritized DQN in Acme, the Acme agent uses a default priority of 1.0 to add a new transition. See e.g. https://github.com/deepmind/acme/blob/076e8e1c1b8e13e8aae9708e94d3e2dca4a7cd03/acme/agents/jax/dqn/builder.py#L137

I believe that this is different from what is in the DQN zoo implementation, not sure if there is a practical difference in performance but I would hope to implement PER in a way that's as close to the original paper as possible.

sabelaraga commented 2 years ago

It is inserted with priority 1.0, but then it uses the PER implementation of Reverb (prioritized sampler here, the interesting part of the code is this

ethanluoyc commented 2 years ago

I see, so in some sense, the priority values is normalized such that the maximum priority would be 1.0, is that correct?

ethanluoyc commented 2 years ago

Actually, not quite. Let's say if we use the td_error to update the priorities, then these values would not be normalized between 0.0 and 1.0. For example, consider an example where the td_error is 5.0, updating the priorities would result in the new priority to be 5.0. Using the priority of 1.0 would not ensure that the newly inserted item has a higher priority than the old samples with large td error.

sabelaraga commented 2 years ago

That would not be a problem if the td_error is capped between [-1, 1]. Do you have an example where it is not? I'm trying to verify on the code, but it may make sense to ask in the Acme repo as well (to make sure there is no issues with the DQN implementation).

ethanluoyc commented 2 years ago

I don't think the DQN in acme clips the td_error. I know some Atari agents clip the max absolute reward to be between -1 and 1, but that doesn't mean the td_error is in anyway bounded.

https://github.com/deepmind/acme/blob/master/acme/agents/jax/dqn/losses.py#L74

I have cross posted to dm-acme