Open nikshepsvn opened 1 year ago
The fast inference is related with 4-bit GPTQ models. Also, the triton code compiles in the first generation, that is why the real speedup comes from the second generation onwards.
If ones prefers a constant speed when using the 4-bit GPTQ models, then the cuda backend has to be used, but it is a bit slower.
And the cuda kernels have to be installed in addition.
I'm running the 7B model on a 3090 and the inference time for the "How to prepare pasta?" is around 20-30s, is this expected?