GPTQ models load much slower than 0cc4m's fork used to

henk717 / KoboldAI

KoboldAI is generative AI software optimized for fictional use, but capable of much more!

http://koboldai.com

GNU Affero General Public License v3.0

359 stars 130 forks source link

Open fkiifdjo opened 1 year ago

fkiifdjo commented 1 year ago

The slow loading happens every time a new chat message is being generated, it's particularly noticeable with 30b models but also noticeable with 13b.

henk717 commented 1 year ago

There are currently two issues going on:

Occam's GPTQ version is not properly compatible with newer huggingface builds so you are falling back to AutoGPTQ in more cases.
AutoGPTQ itself has an outstanding issue where this happens on some systems which they still need to address.

So on the Kobold side we are waiting for either GPTQ package to get the needed updates before the full speed is restored.