Is standalone Prioritized Experience Replay implemented?

bryanyuan1 commented 2 years ago

Hi y'all! Thank you for providing Dopamine as it is such an awesome resource. I am looking for the PER algorithm itself, but I cannot find it in this repo, and what I see is the Rainbow agent which also includes more improvements other than PER.

Have you implemented the standalone version of PER in DQN? I did implemented PER from parts in Rainbow, but I just want to make sure it was correct and reproduces the performance in the PER paper.

bryanyuan1 commented 2 years ago

# Rainbow and prioritized replay are parametrized by an exponent alpha,
# but in both cases it is set to 0.5 - for simplicity's sake we leave it
# as is here, using the more direct tf.sqrt(). Taking the square root
# "makes sense", as we are dealing with a squared loss.
# Add a small nonzero value to the loss to avoid 0 priority items. While
# technically this may be okay, setting all items to 0 priority will cause
# troubles, and also result in 1.0 / 0.0 = NaN correction terms.
update_priorities_op = self._replay.tf_set_priority(
    self._replay.indices, tf.sqrt(loss + 1e-10))

In https://github.com/google/dopamine/blob/master/dopamine/agents/rainbow/rainbow_agent.py, could you explain why we are updating priorities to tf.sqrt(loss + 1e-10)? Is it because the loss here is the squared TD-error?

Here the comment says "as we are dealing with a squared loss", but I think here we're using tf.nn.softmax_cross_entropy_with_logits. Is that a squared loss?

psc-g commented 2 years ago

hii, thanks for the note!

regarding your first question, you should be able to just change _build_replay_buffer so it creates an instance of prioritized_replay_buffer, and modify the _store_transition function to also store priorities (like is done in Rainbow).

Perhaps a good reference is to look at the code for our Revisiting Rainbow paper which adds PER to DQN in the DQN code.

regarding your second question, i believe the cross_entropy loss can be considered as an alternative to squared loss, which is why this "makes sense" (quotations explicit).

hope this helps!

bryanyuan1 commented 2 years ago

Thank you for the answer!

google / dopamine

Is standalone Prioritized Experience Replay implemented? #182