What if the observation is extracted features instead of images and has much smaller dimension than latent?

Hi! I'm not sure you still do Q&A support here :blush:, but I'm obsessed a certain problem beyond my math skills. I hope you could help me.

The question is related to the loss function of your RSSM which uses variational approach. The reconstruction loss of VAE is p(o_t|s_t) as it is decoder from latent to image. In this case, an observation(=image) has much bigger dimension than the latent. But when it comes to the case in which o_t has much smaller dimension (for example, 4 values like cartpole of OpenAI gym classic_control) than the latent(let's say this is 32~64 here), I think p(o_t|s_t) could not learn any meaningful distribution. Because the conditional s_t was sampled from variational posterior q(s_t|a_1:t, o_1:t) which already has seen the observation of current timestep o_t, I suspect that s_t could just learn to copy the full o_t inside s_t because the dimension of s_t is much bigger.

In this situation (non-image and small dimension of observation), can we still hold this VAE-like approach? Or is there some other technique more reasonable in this case? I hope this worry makes sense to you. :confused:

google-research / planet

What if the observation is extracted features instead of images and has much smaller dimension than latent? #59