Folded Recurrent Neural Networks for Future Video Prediction

related paper

摘要
Main challenges in future video prediction are high variability in videos, temporal propagation of errors, and non-specificity of future frames. This work introduces bijective Gated Recurrent Units (bGRU). Standard GRUs update a state, exposed as output, given an input. We extend them by considering the input as another recurrent state, and update it given the output using an extra set of logic gates. Stacking multiple such layers results in a recurrent auto-encoder: the operators updating the outputs comprise the encoder, while the ones updating the inputs form the decoder. Being the encoder and decoder states shared, the representation is stratified during learning: some information is not passed to the next layers. We show how only the encoder or decoder needs to be applied for encoding or prediction. This reduces the computational cost and avoids re-encoding predictions when generating multiple frames, mitigating error propagation. Furthermore, it is possible to remove layers from a trained model, giving an insight to the role of each layer. Our approach improves state of the art results on MMNIST and UCF101, being competitive on KTH with 2 and 3 times less memory usage and computational cost than the best scored approach.

摘要

Main challenges in future video prediction are high variability in videos, temporal propagation of errors, and non-specificity of future frames. This work introduces bijective Gated Recurrent Units (bGRU). Standard GRUs update a state, exposed as output, given an input. We extend them by considering the input as another recurrent state, and update it given the output using an extra set of logic gates. Stacking multiple such layers results in a recurrent auto-encoder: the operators updating the outputs comprise the encoder, while the ones updating the inputs form the decoder. Being the encoder and decoder states shared, the representation is stratified during learning: some information is not passed to the next layers. We show how only the encoder or decoder needs to be applied for encoding or prediction. This reduces the computational cost and avoids re-encoding predictions when generating multiple frames, mitigating error propagation. Furthermore, it is possible to remove layers from a trained model, giving an insight to the role of each layer. Our approach improves state of the art results on MMNIST and UCF101, being competitive on KTH with 2 and 3 times less memory usage and computational cost than the best scored approach.

guanfuchen / videopred

Folded Recurrent Neural Networks for Future Video Prediction #6