rmihaylov / falcontune

Tune any FALCON in 4-bit
Apache License 2.0
468 stars 52 forks source link

Inference speed for 7B models (triton backend, GTX 3090) #3

Open nikshepsvn opened 1 year ago

nikshepsvn commented 1 year ago

I'm running the 7B model on a 3090 and the inference time for the "How to prepare pasta?" is around 20-30s, is this expected?

image

rmihaylov commented 1 year ago

The fast inference is related with 4-bit GPTQ models. Also, the triton code compiles in the first generation, that is why the real speedup comes from the second generation onwards.

rmihaylov commented 1 year ago

If ones prefers a constant speed when using the 4-bit GPTQ models, then the cuda backend has to be used, but it is a bit slower.

rmihaylov commented 1 year ago

And the cuda kernels have to be installed in addition.