pytorch-labs / segment-anything-fast

A batched offline inference oriented version of segment-anything
Apache License 2.0
1.19k stars 70 forks source link

how to reproduce memory snapshot in doc? #112

Open LucQueen opened 7 months ago

LucQueen commented 7 months ago

hi,how to reproduce memory snapshot in doc? image

what i get is image

I‘m very confused the reason that can not get ‘add_decomposed_rel_pos’ stack informations in memory snapshot, and how to get full stack backtrace. The torch version is 2.2, following up instructions in https://github.com/pytorch-labs/segment-anything-fast/tree/main/experiments#installation-instructions Looking forward to a reply.

cpuhrsch commented 7 months ago

Hi @LucQueen - which GPU type are you running on? Thank you.

LucQueen commented 7 months ago

@cpuhrsch thanks for your reply. The GPU type is A100 80G SXM.

cpuhrsch commented 7 months ago

@LucQueen - oh ok! That's probably because you're using the fused kernel. So add_decomposed_rel_pos has been shortened and now doesn't materialize the full attention mask instead. Instead we're using flash_4 to fuse the construction, which is a lot more memory efficient. The reason you're seeing a bigger memory footprint is probably because you're using a larger batch size than from that snapshot. That snapshot is from the unmodified segment-anything.

See https://github.com/pytorch-labs/segment-anything-fast/blob/387488bc4c7ab2ae311fb0632b34cab5cbfbab78/segment_anything_fast/modeling/image_encoder.py#L233-L247

LucQueen commented 7 months ago

@cpuhrsch thanks for your reply! I am using batch-size 16, whereas doc is using batch-size 8, so I am seeing a bigger memory footprint. But I'm still very confused, I see memory snapshot in doc is also using fused kernel, you can see it by marked in red box from the picture, why I can not get ‘add_decomposed_el_pos’ stack informations in memory snapshot. image

cpuhrsch commented 7 months ago

@LucQueen - Ah! Hm, I'm not sure. Is your picture from the latest version of segment-anything-fast?

The picture you reference is from a section within the blog and not based on the most recent version of segment-anything-fast. It was recorded from an earlier version without the fused kernels.