Why predict the current step's observation and BEV segmentation? Why not directly predict the observation and segmentation of the next step?
If so, imagination was not needed, one can directly use the prediction of image as the input for the next step. The model architecture would be simpler.
Why predict the current step's observation and BEV segmentation? Why not directly predict the observation and segmentation of the next step? If so, imagination was not needed, one can directly use the prediction of image as the input for the next step. The model architecture would be simpler.