openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.58k stars 4.85k forks source link

Incorrectly Normalized Importance Weights in Prioritized Experience Replay #1131

Open wegfawefgawefg opened 4 years ago

wegfawefgawefg commented 4 years ago

In the original PER paper, the priority importance weights are normalized by the max weight. However, in the code, what is being called "max_weight" is actually the normalized weight minimum...

Was this done intentionally? Wouldn't this make the importance weights much larger than they should be? Not just larger, if the priorities were not already normalized, this would put many of them at a value above 1.0

Additionally, I understand fixing this would require a MaxSegTree. It would replace the MinSegTree.

https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py#L159

`# # class PrioritizedReplayBuffer(ReplayBuffer): .... .... def sample(self, batch_size, beta): .... assert beta > 0 idxes = self._sample_proportional(batch_size)

    weights = []
    p_min = self._it_min.min() / self._it_sum.sum()
    max_weight = (p_min * len(self._storage)) ** (-beta)

    for idx in idxes:
        p_sample = self._it_sum[idx] / self._it_sum.sum()
        weight = (p_sample * len(self._storage)) ** (-beta)
        weights.append(weight / max_weight)
    weights = np.array(weights)
    encoded_sample = self._encode_sample(idxes)
    return tuple(list(encoded_sample) + [weights, idxes])

`

wegfawefgawefg commented 4 years ago

I think that the intention is to use the min for inverse normalization. That is due to the weights needing to be small for high priority states, but big for low priority states.

You could just as easily do regular normalization, and then weights = 1-weights. This isn't a negligible decision, however, as when using inverse normalization the ratios of the weights to eachother, is not identical to the ratios of priorities to eachother.

Consider the following:

`# #

a array([1, 2, 3, 4, 5])

a / a.max() array([0.2, 0.4, 0.6, 0.8, 1. ])

a.min() / a array([1. , 0.5 , 0.33333333, 0.25 , 0.2 ])`

As you can see, with inverse normalization the priority ratios are not preserved. I don't know why this decision was made, but it's definitely not clear from the code.

beardyFace commented 4 months ago

I have run into this same confusion about the implementation here - it has cropped up in several other pieces of code and implementations, but no one has given any indication as to why the weights are being calculated differently than the original PER approach.

Is anyone able to give any insight into this implementation decision?