prov-gigapath / prov-gigapath

Prov-GigaPath: A whole-slide foundation model for digital pathology from real-world data
Other
371 stars 46 forks source link

Large coordinates cause CUDA error? #9

Closed yihu-dev closed 4 months ago

yihu-dev commented 4 months ago

Hi,

Great work!

When I was extracting the slide level embed, it works well on small coords, but the CUDA error was trigged if the coords value is large, say, 28000.

40 out = rearrange(out, 'b l (r h) d -> b l h d r', r=ratio) ---> 41 out = torch.diag_embed(out, offset=0, dim1=4, dim2=5) 42 out = rearrange(out, 'b l h d r1 r2 -> b (r2 h) (l r1) d', r1=ratio, r2=ratio) 44 lse = rearrange(lse, 'b (r h) l -> b l h r', r=ratio)

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The features are extracted as 10x magnification for quick test, the feature shape is [1, 5308, 1536]) and the coords shape is [1, 5308, 2], and both in cuda float16 type. The coords are x,y patch locations as mentioned in issue#2

For coordinates, it's basically X-Y, for example, (256, 256), (256, 512), (256, 768), ...

I use single A100 GPU with 40g memory. I'm wondering if the large coordinates cause the CUDA error.

Thanks

afiliot commented 4 months ago

Hey @yihu-dev, how much time did it take for you to run the inference on 5k tiles with your A100 ? Thanks

yihu-dev commented 4 months ago

Hey @yihu-dev, how much time did it take for you to run the inference on 5k tiles with your A100 ? Thanks

Hi, it takes less than 4 mins to do tile encoding and around 0.1s to extract slide level feature.

PabloMeseguerEsbri commented 4 months ago

Hi,

Great work!

When I was extracting the slide level embed, it works well on small coords, but the CUDA error was trigged if the coords value is large, say, 28000.

40 out = rearrange(out, 'b l (r h) d -> b l h d r', r=ratio) ---> 41 out = torch.diag_embed(out, offset=0, dim1=4, dim2=5) 42 out = rearrange(out, 'b l h d r1 r2 -> b (r2 h) (l r1) d', r1=ratio, r2=ratio) 44 lse = rearrange(lse, 'b (r h) l -> b l h r', r=ratio) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The features are extracted as 10x magnification for quick test, the feature shape is [1, 5308, 1536]) and the coords shape is [1, 5308, 2], and both in cuda float16 type. The coords are x,y patch locations as mentioned in issue#2

For coordinates, it's basically X-Y, for example, (256, 256), (256, 512), (256, 768), ...

I use single A100 GPU with 40g memory. I'm wondering if the large coordinates cause the CUDA error.

Thanks

Same error in when using cuda and torch.float16 (looks to be required to support FlashAttention) in line:

File "/workspace/code/gigapath/slide_encoder.py", line 204, in forward
    x = torch.cat((cls_tokens, x), dim=1)

But to me it looks to be independent of the number of patches as it passes for [1,707,1536] but not for [1,491,1536]

HanwenXuTHU commented 4 months ago

Hi,

Great work!

When I was extracting the slide level embed, it works well on small coords, but the CUDA error was trigged if the coords value is large, say, 28000.

40 out = rearrange(out, 'b l (r h) d -> b l h d r', r=ratio) ---> 41 out = torch.diag_embed(out, offset=0, dim1=4, dim2=5) 42 out = rearrange(out, 'b l h d r1 r2 -> b (r2 h) (l r1) d', r1=ratio, r2=ratio) 44 lse = rearrange(lse, 'b (r h) l -> b l h r', r=ratio) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The features are extracted as 10x magnification for quick test, the feature shape is [1, 5308, 1536]) and the coords shape is [1, 5308, 2], and both in cuda float16 type. The coords are x,y patch locations as mentioned in issue#2

For coordinates, it's basically X-Y, for example, (256, 256), (256, 512), (256, 768), ...

I use single A100 GPU with 40g memory. I'm wondering if the large coordinates cause the CUDA error.

Thanks

Hi, thank you for your interests in our work!

I just tried the following and it works well for me.

import torch
import gigapath.slide_encoder as slide_encoder

# load from HuggingFace
model = slide_encoder.create_model("hf_hub:prov-gigapath/prov-gigapath", "gigapath_slide_enc12l768d", 1536)
print("param #", sum(p.numel() for p in model.parameters()))

tile_embed = torch.randn(1, 5308, 1536).cuda()
coords = 28000 * torch.ones(1, 5308, 2).cuda()
model = model.cuda()

with torch.no_grad():
    with torch.cuda.amp.autocast(dtype=torch.float16):
        output = model(tile_embed, coords)
print(output)

Did you incorporate with torch.cuda.amp.autocast(dtype=torch.float16): when doing inference? Would be nice if you could share your inputs and codes. Looking forward to hearing from you!

PabloMeseguerEsbri commented 4 months ago

@HanwenXuTHU, thanks for the feedback. with torch.cuda.amp.autocast(dtype=torch.float16): worked for me. In another order of things, what motivated having different length in the latent space between patch-level (1536) and slide-level (768) features?

HanwenXuTHU commented 4 months ago

@HanwenXuTHU, thanks for the feedback. with torch.cuda.amp.autocast(dtype=torch.float16): worked for me. In another order of things, what motivated having different length in the latent space between patch-level (1536) and slide-level (768) features?

Good question! The 1536 dimension is a standard ViT-Giant setting while 768 is a ViT-Base setting. We chose ViT-G -> ViT-B as a first design choice.

yihu-dev commented 4 months ago

@HanwenXuTHU, it worked, thank you! I have not used that line but just cast both the model and data to float16.