As mentioned in the paper, for inference, the whole video clip is fed into the network because no position encoding is added.
What if the input is a non-stopping video stream? Can we make some adjustments to actionformer, like adding position encoding? Will it work?
As mentioned in the paper, for inference, the whole video clip is fed into the network because no position encoding is added. What if the input is a non-stopping video stream? Can we make some adjustments to actionformer, like adding position encoding? Will it work?