hyperonym / basaran

Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.
MIT License
1.29k stars 80 forks source link

Slow Streaming #99

Closed manojpreveen closed 1 year ago

manojpreveen commented 1 year ago

Thanks for this package, working great and pretty fast too when i tried using this for Bloomz 7B model but when i tried the same for this model : [GPT-NeoXT-Chat-Base-20B] (https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B), the streaming token generation seems very very slow (~1 token / 2-3 secs).

Just checking if this is expected or am i missing something, as i can see you guys have tested this model too as per README

I'm running it on single A100 machine and during the streaming token generation the GPU Util is around ~55%.

peakji commented 1 year ago

Hi @manojpreveen ! 20B is indeed a relatively large model, but the speed should not be this slow.

You can try adding MODEL_HALF_PRECISION=true to your environment variables to enable half-precision quantization, which will reduce memory usage while improving the speed of generation.

Especially if you are using the 40GB version of A100, the 20B model may exceed the memory limit in full precision, forcing it to start swapping, which will slow down the speed.

Similarly, you can also use MODEL_LOAD_IN_8BIT=true to enable INT8 quantization. However, not every model is well compatible with this option, so it's still recommended to use the above-mentioned half-precision option first.

manojpreveen commented 1 year ago

Yeah enabling half precision definitely made a difference and is much faster now. Thanks. Closing the issue.