In the training of Interpolation Transformer, given the latent space is 5 16 16, I found the first 16 16 and the last 16 16 tokens join the gradient propagation. But in the inference of Interpolation Transformer, the first and last 16 16 tokens are given. So, in my opinion, the first 16 16 and the last 16 * 16 tokens should not take part in gradient back-propagation during the training process? Please correct me if I'm wrong.
Hi Kang, good point. I think you are right. I suspect that if you mask out the loss on the initial and last frames, it shouldn't affect the model performance.
Dear author:
In the training of Interpolation Transformer, given the latent space is 5 16 16, I found the first 16 16 and the last 16 16 tokens join the gradient propagation. But in the inference of Interpolation Transformer, the first and last 16 16 tokens are given. So, in my opinion, the first 16 16 and the last 16 * 16 tokens should not take part in gradient back-propagation during the training process? Please correct me if I'm wrong.
Kang