hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

Question about how is stacking done in vecFrameStack #1084

Closed eliork closed 3 years ago

eliork commented 3 years ago

https://github.com/hill-a/stable-baselines/blob/259f27868f0d727d990f50e04da6e3a5d5367582/stable_baselines/common/vec_env/vec_frame_stack.py#L27-L43

Hi, I am trying to read this code, and I am having difficulties understanding how is the stacking made? for example if I have a single observation of shape (1,128) and I want to stack 4 observations, does observations hold the 4 observations together? where is the 4 observations concatenated? and also what does this line mean?

self.stackedobs[..., -observations.shape[-1]:] = observations

Thank you

Miffyli commented 3 years ago

That wrapper follows the frame-stacking idea of original DQN, where last four images are stacked on their channel-dimension, which in code translates to stacking the observations on last axis. This would mean your (1, 128) observations would become (1, 128 * 4), but do note that this wrapper was not designed for non-image observations per se.

Edit: The extra line is for handling terminal states correctly. where we have to update the terminal_state information in info dict with frame stacking (note that with vecenvs you do not normally receive terminal observation so we have to use info dict to pass that forward).

eliork commented 3 years ago

This would mean your (1, 128) observations would become (1, 128 * 4), but do note that this wrapper was not designed for non-image observations per se.

Thanks, I am trying to stack 4 observations together, each represent the latent space of a VAE encoder. My thought was to take the same theory about frame stacking, but to apply it to a VAE latent space, thinking it does represent an image, or am I completely wrong about my thought process, according to what you wrote that the wrapper was not designed for non image observations?

Thank you for your detailed explanation

araffin commented 3 years ago

Linking related issue: https://github.com/araffin/learning-to-drive-in-5-minutes/issues/36

Miffyli commented 3 years ago

@eliork I suggest you to take a look at the link araffin provided (very similar setup). Frame stacking can kind of work here: the latent codes of the four images are fed to the network and it will happily process them. You will lose the temporal information but same happens with the Atari's way of stacking frames. I expect this solution would work better than just feeding a single frame, but I can not say for sure and you need to run experiments to see what works best for your solution.

eliork commented 3 years ago

Will do, Thank you very much!