Wondering whether some of the triton or cuda kernel also speedup fp16 or not?

qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

Apache License 2.0

2.98k stars 457 forks source link

Wondering whether some of the triton or cuda kernel also speedup fp16 or not? #253

Open drxmy opened 1 year ago

drxmy commented 1 year ago

I am not familiar with triton or cuda. But it feels like some code(fused_attm) can also be used in fp16 to gain inference speedup compared with huggingface?