Mathieu et al 2016 showed that loss based on GANs produce sharper generations than do L2-based losses. (torch implementation here)
Denton and Fergus 2018 unsupervised video generation model that learns a prior model of uncertainty in a given environment. Video frames are generated by drawing samples from this prior and combining them with a deterministic estimate of the future frame. Relevant because we do want to assign different priors to regions near and far from the equator.
Reda et al 2018 present a both past frames and optical flows.