Open innokean opened 1 year ago
I'm curious if anyone has done systematic comparison of original LLAMA inferencing and int8 inferencing.
@tloen great work! much appreciated. I was able to run your example 13B on 16GB GPU (it was tight thou).
I'll leave the task to someone who's actually able to run 13B+ unquantized! In my experiments with 7B, though, I'm getting very low KL divergences (on the order of 10^-3) from the outputs of the original model.
I'm curious if anyone has done systematic comparison of original LLAMA inferencing and int8 inferencing.
@tloen great work! much appreciated. I was able to run your example 13B on 16GB GPU (it was tight thou).