Open LucQueen opened 7 months ago
Hi @LucQueen - which GPU type are you running on? Thank you.
@cpuhrsch thanks for your reply. The GPU type is A100 80G SXM.
@LucQueen - oh ok! That's probably because you're using the fused kernel. So add_decomposed_rel_pos has been shortened and now doesn't materialize the full attention mask instead. Instead we're using flash_4
to fuse the construction, which is a lot more memory efficient. The reason you're seeing a bigger memory footprint is probably because you're using a larger batch size than from that snapshot. That snapshot is from the unmodified segment-anything.
@cpuhrsch thanks for your reply! I am using batch-size 16, whereas doc is using batch-size 8, so I am seeing a bigger memory footprint. But I'm still very confused, I see memory snapshot in doc is also using fused kernel, you can see it by marked in red box from the picture, why I can not get ‘add_decomposed_el_pos’ stack informations in memory snapshot.
@LucQueen - Ah! Hm, I'm not sure. Is your picture from the latest version of segment-anything-fast?
The picture you reference is from a section within the blog and not based on the most recent version of segment-anything-fast. It was recorded from an earlier version without the fused kernels.
hi,how to reproduce memory snapshot in doc?
what i get is
I‘m very confused the reason that can not get ‘add_decomposed_rel_pos’ stack informations in memory snapshot, and how to get full stack backtrace. The torch version is 2.2, following up instructions in https://github.com/pytorch-labs/segment-anything-fast/tree/main/experiments#installation-instructions Looking forward to a reply.