pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.37k stars 488 forks source link

[example] changed int8 quantization to do fp8 weight-only quantization #33

Open Chillee opened 7 months ago

Chillee commented 7 months ago

In this case I'm guessing that for fp8 you might not need a scale parameter for the weights, since each weight has its own scaling factor.

I haven't done any evals, but this is just an example of weight-only fp8 support if folks want to play with it :P

Perf is at 102.9 tok/s for fp8 vs. 103.8 tok/s for int8 quantization.

Artyom17 commented 4 months ago

May we keep both - int8 and fp8? Why replacing one to another, especially seeing perf degradation (subtle, but still)?

Chillee commented 4 months ago

It's just an example PR - not intending to merge it.