Open abdksyed opened 3 days ago
It is for curriculum learning: first, from easy to hard cases, then anneal back to 5, which is closer to what is used during inference.
Note that max jump sets the maximum. We don't always sample at the maximum.
See https://arxiv.org/pdf/2103.07941 https://davischallenge.org/challenge2020/papers/DAVIS-Semisupervised-Challenge-1st-Team.pdf
Thanks.
Can you also give some intuition on the second part, where how come training on 8 frames videos with max 3 frames in memory, leading to better long video segmentation capability during inference?
Also another question I had was since the number of frames used for training was 8 as mentioned in the paper, how come the model is able to do well for long-videos where frames could be in range of thousands, where the model has never trained for such long form memory usage? Even if the max jump was 15, the highest difference in frames for a single video could be 15*8=120 frames while training.
It generalizes. It is not unlike how CNN generalizes to different resolutions and how LLM generalizes to different sequence lengths with relative position embeddings. Learning a robust appearance representation (as queries/keys) is enough to go a long way. It might not be optimal -- but we don't really have sufficiently long video datasets at the time.
I wanted to know the idea behind the concept of having different
max_skip_values
, The value starts with 10, increases to 15 and again drops back to 5,5.Is there any intuition and reason for doing this?
Also another question I had was since the number of frames used for training was
8
as mentioned in the paper, how come the model is able to do well for long-videos where frames could be in range of thousands, where the model has never trained for such long form memory usage? Even if the max jump was 15, the highest difference in frames for a single video could be 15*8=120 frames while training.