Open Chillee opened 7 months ago
In this case I'm guessing that for fp8 you might not need a scale parameter for the weights, since each weight has its own scaling factor.
I haven't done any evals, but this is just an example of weight-only fp8 support if folks want to play with it :P
Perf is at 102.9 tok/s for fp8 vs. 103.8 tok/s for int8 quantization.
May we keep both - int8 and fp8? Why replacing one to another, especially seeing perf degradation (subtle, but still)?
It's just an example PR - not intending to merge it.
In this case I'm guessing that for fp8 you might not need a scale parameter for the weights, since each weight has its own scaling factor.
I haven't done any evals, but this is just an example of weight-only fp8 support if folks want to play with it :P
Perf is at 102.9 tok/s for fp8 vs. 103.8 tok/s for int8 quantization.