tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

Systematic comparison of original models to int8 inferencing #9

Open innokean opened 1 year ago

innokean commented 1 year ago

I'm curious if anyone has done systematic comparison of original LLAMA inferencing and int8 inferencing.

@tloen great work! much appreciated. I was able to run your example 13B on 16GB GPU (it was tight thou).

tloen commented 1 year ago

I'll leave the task to someone who's actually able to run 13B+ unquantized! In my experiments with 7B, though, I'm getting very low KL divergences (on the order of 10^-3) from the outputs of the original model.