It looks like you're updating the priorities in the replay buffer according to the weighted and squared TD error.
loss = (q_value - expected_q_value.detach()).pow(2) * weights
prios = loss + 1e-5
replay_buffer.update_priorities(indices, prios.data.cpu().numpy())
However, the algorithm in the original paper updates the priority only according to the absolute value of the TD error, which is not weighted. I believe this is a mistake in your implementation
It looks like you're updating the priorities in the replay buffer according to the weighted and squared TD error.
However, the algorithm in the original paper updates the priority only according to the absolute value of the TD error, which is not weighted. I believe this is a mistake in your implementation