hkchengrex / Cutie

[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation
MIT License
688 stars 69 forks source link

Why is max reserved memory much higher than max allocated memory? #112

Closed qinliuliuqin closed 3 weeks ago

qinliuliuqin commented 3 weeks ago

Hi Rex,

Thanks for your significant contribution to VOS! When testing your Cutie-base model on LVOS-val, I found Pytorch needed to cache a large amount of memory, as shown below.

torch.cuda.max_memory_allocated() # 1092M
torch.cuda.max_memory_reserved() # 11302M

Do you have any insights about the cause of such large cached memory? It must be related to the memory management, but I have no idea which operations are the exact cause. Thank you in advance for any explanation.

Best, Qin

hkchengrex commented 3 weeks ago

This is managed by PyTorch, not us. See

qinliuliuqin commented 3 weeks ago

Thanks for your prompt reply. Yes, PyTorch does all the GPU memory management, but why does it need to cache so much for Cutie? It must be related to Cutie's implementation. I developed a recurrent version of Cutie (primarily by replacing the softmax-attention matching to linear-attention matching), and got significantly reduced cached memory usage.

torch.cuda.max_memory_allocated() # 523M
torch.cuda.max_memory_reserved() # 1046M

If PyTorch needs to cache so much for Cutie, it may be very slow to run Cutie on low-end devices with limited memory. I may be wrong on this, and I appreciate any thoughts you may have.

hkchengrex commented 3 weeks ago

If max_memory_reserved decreased after only changing the attention mechanism, it must be caused by the attention, right? I am not sufficiently proficient in how PyTorch decides to take more cache so unfortunately I do not know the reason. However, I believe that these reserved memory are never used (otherwise they would show up as max_allocated) -- so it might not cause problems for devices with limited memory.

qinliuliuqin commented 3 weeks ago

Yes, if the reserved memory is never used, then it wouldn't be an issue. I agree with you that softmax attention may be the real cause. I will let you know if I have new findings on this issue.

qinliuliuqin commented 3 weeks ago

Adding torch.cuda.empty_cache() before processor.step() can resolve this issue. Now I get:

torch.cuda.max_memory_allocated() # 1094M
torch.cuda.max_memory_reserved() # 1530M

I guess this is because PyTorch by default will automatically cache the keys and values of a sequence in softmax attention.

# T=1700, max number of frames in a video
# H, W=480, 853 (all video frames in LVOS val are of the same spatial resolution)
# N=4, max number of objects in a video
# Cv=256, value dimension
# Ck=64, key dimension
# f32 = 4, float32 bytes
max_reserved_mem = T x H/16 x W/16 x (N x Cv + Ck) x f32 bytes ~= 10371M

Caching these keys and values is efficient for parallel training/inference of softmax self-attention. However, for softmax matching in memory-based VOS, we are doing cross-attention not self-attention. In other words, we don't need to cache these keys and values to re-use them. This explanation makes sense to me, though I could be wrong.