Open yanboliang opened 3 months ago
Seeing similar issues with AMD gpus as well. With AMD GPUs, we are seeing a memory fault rather that device assertions. Looks like kernels generated for AMD doesn't have these device asserts.
Memory access fault by GPU node-3 (Agent handle: 0x80e6680) on address 0x7eff45229000. Reason: Unknown.
Aborted (core dumped)
Observations:
So it seems to me the error is related to some interactions b/w the compiled prefill and decode kernels
Looks like prefill compile can work, if I change next_token.view(1, -1)
to next_token.clone().view(1, -1)
here
Is there a resolution to this problem for --compile
only? I am still getting it
pytorch-triton==3.0.0+45fff310c8
torch==2.4.0.dev20240527+cu121
torchaudio==2.2.0.dev20240528+cu121
torchvision==0.19.0.dev20240528+cu121
@griff4692 Does https://github.com/pytorch-labs/gpt-fast/issues/137#issuecomment-2025959457 work?
@griff4692 Does #137 (comment) work?
Nope unfortunately -- it looks like in current code the next_token
is already cloned anyway
https://github.com/pytorch-labs/gpt-fast/issues/137#issuecomment-2025959457
@griff4692 It seems you hit a different issue other than this one, I tried your command and it works well at gpt-fast. So I suspect it's some change on the context-compression side triggered a cudagraph error. I'm looking at which triggers it now.
Repro command:
Errors:
generated kernel file: https://gist.github.com/yanboliang/6f5c1171e63909b995b5372dc7c88ab7