Closed qinliuliuqin closed 2 months ago
This is managed by PyTorch, not us. See https://pytorch.org/docs/stable/notes/cuda.html#memory-management
Thanks for your prompt reply. Yes, PyTorch does all the GPU memory management, but why does it need to cache so much for Cutie? It must be related to Cutie's implementation. I developed a recurrent version of Cutie (primarily by replacing the softmax-attention matching to linear-attention matching), and got significantly reduced cached memory usage.
torch.cuda.max_memory_allocated() # 523M
torch.cuda.max_memory_reserved() # 1046M
If PyTorch needs to cache so much for Cutie, it may be very slow to run Cutie on low-end devices with limited memory. I may be wrong on this, and I appreciate any thoughts you may have.
If max_memory_reserved
decreased after only changing the attention mechanism, it must be caused by the attention, right? I am not sufficiently proficient in how PyTorch decides to take more cache so unfortunately I do not know the reason. However, I believe that these reserved memory are never used (otherwise they would show up as max_allocated) -- so it might not cause problems for devices with limited memory.
Yes, if the reserved memory is never used, then it wouldn't be an issue. I agree with you that softmax attention may be the real cause. I will let you know if I have new findings on this issue.
Adding torch.cuda.empty_cache()
before processor.step()
can resolve this issue. Now I get:
torch.cuda.max_memory_allocated() # 1094M
torch.cuda.max_memory_reserved() # 1530M
I guess this is because PyTorch by default will automatically cache the keys and values of a sequence in softmax attention.
# T=1700, max number of frames in a video
# H, W=480, 853 (all video frames in LVOS val are of the same spatial resolution)
# N=4, max number of objects in a video
# Cv=256, value dimension
# Ck=64, key dimension
# f32 = 4, float32 bytes
max_reserved_mem = T x H/16 x W/16 x (N x Cv + Ck) x f32 bytes ~= 10371M
Caching these keys and values is efficient for parallel training/inference of softmax self-attention. However, for softmax matching in memory-based VOS, we are doing cross-attention not self-attention. In other words, we don't need to cache these keys and values to re-use them. This explanation makes sense to me, though I could be wrong.
Hi Rex,
Thanks for your significant contribution to VOS! When testing your Cutie-base model on LVOS-val, I found Pytorch needed to cache a large amount of memory, as shown below.
Do you have any insights about the cause of such large cached memory? It must be related to the memory management, but I have no idea which operations are the exact cause. Thank you in advance for any explanation.
Best, Qin