twni2016 / pomdp-baselines

Simple (but often Strong) Baselines for POMDPs in PyTorch, ICML 2022
https://sites.google.com/view/pomdp-baselines
MIT License
293 stars 41 forks source link

Double Episode in minibatch #20

Closed SandervanL closed 11 months ago

SandervanL commented 11 months ago

There is a bug in how episodes are saved vs how they are retrieved in the SeqReplayBuffer class. Episodes are stored according to their actual length, but are retrieved based on their maximum length.

For example, let's say my environment has a maximal episode length is 100, and terminates upon success. My agent is doing well and ending subsequent episodes after 20 and 30 steps. These episodes are stored in the buffer directly after each other, and the third episode will start at index 50.

However, when sampling episodes, the random_episodes method does not account for this, and assumes in its first for loop that every episode is the maximum episode length (self._sampled_seq_len). When the first episode is sampled, it will contain indices 0-100, and will thus also contain episode 2 and 3.

I propose this can be fixed by changing the incrementing of self._top and self._size in the add_episode method. In these lines, change + seq_len to + self._sampled_seq_len

twni2016 commented 11 months ago

Hi, I don't think there is a bug here. The sampling method will also return masks to indicate if the item is valid. Therefore, the agent is only trained on the first episode, and ignores the others.