Some question about the paper and code.

945716994 commented 1 year ago

Why you train the decoder (in ae) use the action? I understand $z_t = encoder(x_t)$, but why $x_t = decoder(z_t, a_t)$?
Can I understand the regime variable $Rt$ in your paperin your paper as an action variable $a{t-1}$ in RL?
What is the 'Mutual information regularization' effect in ae training process? I can't find something explain it sentence in your paper

Looking forward to your reply!!

phlippe commented 1 year ago

Hi, thanks for your questions!

This is only needed when the regime, $R^t$, also influences the image. In the experiments of the paper, this was only the case for CausalWorld where $R^t$ is the arm positions/motor angles of the tri-finger, and the robot itself was visible in the image. For other datasets, no action was used in the decoder.
Yes, the action variable can be one example for a regime. There are more possible settings, e.g. environment properties or agent states.
We ended up not using it, so you can ignore it. It was a potential way of separating the robot position in the CausalWorld further from the latents $z^t$, although it ended up not being needed nor providing improvements.

945716994 commented 1 year ago

Hi, thanks for your questions!

This is only needed when the regime, Rt, also influences the image. In the experiments of the paper, this was only the case for CausalWorld where Rt is the arm positions/motor angles of the tri-finger, and the robot itself was visible in the image. For other datasets, no action was used in the decoder.

Yes, the action variable can be one example for a regime. There are more possible settings, e.g. environment properties or agent states.

We ended up not using it, so you can ignore it. It was a potential way of separating the robot position in the CausalWorld further from the latents zt, although it ended up not being needed nor providing improvements.

Thanks for your replay! I still have some confusion about question 1. You said "This is only needed when the regime, Rt, also influences the image." . Assumption I have a image $x{t-1}$, and i do a action (e.g. move the robot arm in CausalWorld or Pick something object in Ai2thor) $R{t-1}$，then i get the new image $x_t$. In this situation, the regime $R_t$ will influence next time image, so when we train the autoencoder, the $zt=encoder(x{t-1})$ but the reconstrcution image $x{t-1}^{'}$ in decoder needed consider $R{t-1}$ ？

The core of my confusion is that $R_t$ affects the generation of images at time t, so $z_t$ obtained by encoder contains $R_t$ information, then why does $R_t$ need to be considered in decoder

phlippe commented 1 year ago

Sorry for the confusion, I distinguish here between implicit/indirect and explicit/direct influence to the image. Most environments only have an 'implicit'/'indirect' effect, meaning that the action impacts how the causal factors change, but only the causal factors are visualized. In other words, the image only shows the causal variables at a time point. In CausalWorld, the action/regime has an 'explicit', or 'direct', effect on the image. The causal variables alone cannot explain the image anymore, but the robotic state itself can be seen in the image. Hence, a plain autoencoder would place $R^t$ in the latent space as well. However, we do not need $R^t$ in the latent space since (a) it is observed, and (b) it is not a causal variable of interest. Thus, we give it explicitly to the decoder to remove the need for placing $R^t$ in the latent space.

945716994 commented 1 year ago

Sorry for the confusion, I distinguish here between implicit/indirect and explicit/direct influence to the image. Most environments only have an 'implicit'/'indirect' effect, meaning that the action impacts how the causal factors change, but only the causal factors are visualized. In other words, the image only shows the causal variables at a time point. In CausalWorld, the action/regime has an 'explicit', or 'direct', effect on the image. The causal variables alone cannot explain the image anymore, but the robotic state itself can be seen in the image. Hence, a plain autoencoder would place Rt in the latent space as well. However, we do not need Rt in the latent space since (a) it is observed, and (b) it is not a causal variable of interest. Thus, we give it explicitly to the decoder to remove the need for placing Rt in the latent space.

Thank you very much for answering my confusion！ I have one last question. Assuming I have completed the training of the model, how can I select the departments related to Causal variables $C$ from latent variable $z$ and display them through decoder? （For example, one part of z learns the causal variable information that represents the background, and the other part z may represent an object.) Can you provide a demo? Thank you very much

phlippe commented 1 year ago

Depending on the data you have available, you can either

Check with a few labeled samples which latent variables correlate with each causal variable, as done for the evaluations in the paper;
Check which actions lead to the predicted interaction variables being 1, giving you indications which causal variable may be intervened at that time;
or simply perform the triplet evaluation, as shown in the demo for example, to identify which latent variables have what effect in the image

phlippe / BISCUIT

Some question about the paper and code. #3