Double Episode in minibatch

There is a bug in how episodes are saved vs how they are retrieved in the SeqReplayBuffer class. Episodes are stored according to their actual length, but are retrieved based on their maximum length.

For example, let's say my environment has a maximal episode length is 100, and terminates upon success. My agent is doing well and ending subsequent episodes after 20 and 30 steps. These episodes are stored in the buffer directly after each other, and the third episode will start at index 50.

However, when sampling episodes, the random_episodes method does not account for this, and assumes in its first for loop that every episode is the maximum episode length (self._sampled_seq_len). When the first episode is sampled, it will contain indices 0-100, and will thus also contain episode 2 and 3.

I propose this can be fixed by changing the incrementing of self._top and self._size in the add_episode method. In these lines, change + seq_len to + self._sampled_seq_len

twni2016 / pomdp-baselines

Double Episode in minibatch #20