Why do we have different/decreasing`skip_values` as we progress in stage 03 training

hkchengrex / XMem

[ECCV 2022] XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

https://hkchengrex.com/XMem/

MIT License

1.74k stars 191 forks source link

Why do we have different/decreasing`skip_values` as we progress in stage 03 training #145

Open abdksyed opened 3 days ago

abdksyed commented 3 days ago

I wanted to know the idea behind the concept of having different max_skip_values, The value starts with 10, increases to 15 and again drops back to 5,5.

Is there any intuition and reason for doing this?

Also another question I had was since the number of frames used for training was 8 as mentioned in the paper, how come the model is able to do well for long-videos where frames could be in range of thousands, where the model has never trained for such long form memory usage? Even if the max jump was 15, the highest difference in frames for a single video could be 15*8=120 frames while training.

hkchengrex commented 3 days ago

It is for curriculum learning: first, from easy to hard cases, then anneal back to 5, which is closer to what is used during inference.

Note that max jump sets the maximum. We don't always sample at the maximum.

See https://arxiv.org/pdf/2103.07941 https://davischallenge.org/challenge2020/papers/DAVIS-Semisupervised-Challenge-1st-Team.pdf

abdksyed commented 1 day ago

Thanks.

Can you also give some intuition on the second part, where how come training on 8 frames videos with max 3 frames in memory, leading to better long video segmentation capability during inference?

Also another question I had was since the number of frames used for training was 8 as mentioned in the paper, how come the model is able to do well for long-videos where frames could be in range of thousands, where the model has never trained for such long form memory usage? Even if the max jump was 15, the highest difference in frames for a single video could be 15*8=120 frames while training.

hkchengrex commented 17 hours ago

It generalizes. It is not unlike how CNN generalizes to different resolutions and how LLM generalizes to different sequence lengths with relative position embeddings. Learning a robust appearance representation (as queries/keys) is enough to go a long way. It might not be optimal -- but we don't really have sufficiently long video datasets at the time.