Thanks for your great work and code! I spot that at the training time of most environments, only data collected with z from the prior is added to the encoder buffer --- num_steps_posterior is set to zero for these environments. What's the reasoning behind this decision making? Why not include data collected with z from the posterior in the encoder buffer?
We found this setting worked better for these shaped reward environments, in which exploration doesn't seem to be crucial for identifying and solving the task.
Thanks for your great work and code! I spot that at the training time of most environments, only data collected with
z
from the prior is added to the encoder buffer ---num_steps_posterior
is set to zero for these environments. What's the reasoning behind this decision making? Why not include data collected withz
from the posterior in the encoder buffer?