hkchengrex / XMem

[ECCV 2022] XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
https://hkchengrex.com/XMem/
MIT License
1.72k stars 191 forks source link

Question about the algorithm and training procedure #136

Closed zzzc18 closed 7 months ago

zzzc18 commented 8 months ago

Hi Cheng,

I'm new to the VOS area and after reading the paper I've still got two questions about the algorithm.

  1. Does the readout in XMem (and in Cutie) turn the VOS task from learning an $\text{img}\rightarrow\text{mask}$ map to learning a $\text{similar img}\rightarrow\text{similar mask}\rightarrow\text{mask}$ map through the retrieval process? (local feature level)
  2. Is the long-term memory module not involved in the training process? Does it only occur at test time? As you state in the paper the training sequences are of length eight. Which seems smaller than the $T_{max}=10$.

Thank you for taking the time to read this issue. I greatly appreciate any advice you can provide.

hkchengrex commented 7 months ago
  1. That is more about STCN. I think you can look at it that way on a high level.
  2. It is used in test-time only.
zzzc18 commented 7 months ago

Thank you for your reply!