Closed manojpreveen closed 1 year ago
Hi @manojpreveen ! 20B is indeed a relatively large model, but the speed should not be this slow.
You can try adding MODEL_HALF_PRECISION=true
to your environment variables to enable half-precision quantization, which will reduce memory usage while improving the speed of generation.
Especially if you are using the 40GB version of A100, the 20B model may exceed the memory limit in full precision, forcing it to start swapping, which will slow down the speed.
Similarly, you can also use MODEL_LOAD_IN_8BIT=true
to enable INT8 quantization. However, not every model is well compatible with this option, so it's still recommended to use the above-mentioned half-precision option first.
Yeah enabling half precision definitely made a difference and is much faster now. Thanks. Closing the issue.
Thanks for this package, working great and pretty fast too when i tried using this for Bloomz 7B model but when i tried the same for this model : [GPT-NeoXT-Chat-Base-20B] (https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B), the streaming token generation seems very very slow (~1 token / 2-3 secs).
Just checking if this is expected or am i missing something, as i can see you guys have tested this model too as per README
I'm running it on single A100 machine and during the streaming token generation the GPU Util is around ~55%.