0000-add-support-for-conversion-fp16-to-fp32.patch limit to compute 6.1

cduk commented 1 month ago

In your 0000-add-support-for-conversion-fp16-to-fp32.patch you convert fp16 to fp32. Is there a way to limit this to compute 6.1 so that P100 (compute 6.0) keesp FP16?

The reason is that FP16 is twice as fast on P100.

the-crypt-keeper commented 1 month ago

@cduk Trust me you want this conversion, it's preventing a crash with --enable-prefix-cache

See vllm #4438

Edit: I've built Triton and vLLM from this repo and can confirm that the crash above is fixed and at least on 70B GPTQ model on my 2x3060+2xP100 I don't see any difference in performance (approx 15.5 Tok/sec)

sasha0552 commented 1 month ago

Hi @cduk.

In your 0000-add-support-for-conversion-fp16-to-fp32.patch you convert fp16 to fp32. Is there a way to limit this to compute 6.1 so that P100 (compute 6.0) keesp FP16?

Is the performance degradation a confirmed behavior? Do you have a crash with the original triton (Cannot convert f16 to f16, not LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32)? There is an attempt to upcast the dot results to FP32 on P40, but I'm not sure if this upcast happens on P100 since it has good FP16 performance.

If you have the Cannot convert f16 to f16 error on the original triton, then conversion to FP32 is necessary anyway. If this error is not present, then no conversion is required and does not occur (conversion only occurs when necessary), so you can safely use triton with this patch.

(There are two patches for triton, one fixes Cannot convert f16 to f16, the other fixes LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32.)

sasha0552 / vllm-ci

0000-add-support-for-conversion-fp16-to-fp32.patch limit to compute 6.1 #2