This repo provides demos and packages to perform fast inference solutions for BLOOM. Some of the solutions have their own repos in which case a link to the corresponding repos is provided instead.
We support HuggingFace accelerate and DeepSpeed Inference for generation.
Install required packages:
pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.2
alternatively you can also install deepspeed from source:
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
CFLAGS="-I$CONDA_PREFIX/include/" LDFLAGS="-L$CONDA_PREFIX/lib/" TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
All the provided scripts are tested on 8 A100 80GB GPUs for BLOOM 176B (fp16/bf16) and 4 A100 80GB GPUs for BLOOM 176B (int8). These scripts might not work for other models or a different number of GPUs.
DS inference is deployed using logic borrowed from DeepSpeed MII library.
Note: Sometimes GPU memory is not freed when DS inference deployment crashes. You can free this memory by running killall python
in terminal.
For using BLOOM quantized, use dtype = int8. Also, change the model_name to microsoft/bloom-deepspeed-inference-int8 for DeepSpeed-Inference. For HF accelerate, no change is needed for model_name.
HF accelerate uses LLM.int8() and DS-inference uses ZeroQuant for post-training quantization.
This asks for generate_kwargs everytime. Example: generate_kwargs =
{"min_length": 100, "max_new_tokens": 100, "do_sample": false}
using HF accelerate
python -m inference_server.cli --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
using DS inference
python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
make
An interactive UI can be launched via the following command to connect to the generation server. The default URL of the UI is http://127.0.0.1:5001/
. The model_name
is just used by the UI to check if the model is decoder or encoder-decoder model.
python -m ui --model_name bigscience/bloom
This command launches the following UI to play with generation. Sorry for the crappy design. Unfotunately, my UI skills only go so far. 😅😅😅
using HF accelerate
python -m inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5
using DS inference
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
alternatively, to load model faster:
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
using DS ZeRO
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
If you run into things not working or have other questions please open an Issue in the corresponding backend:
If there a specific issue with one of the scripts and not the backend only then please open an Issue here and tag @mayank31398.
Solutions developed to perform large batch inference locally:
JAX:
A solution developed to be used in a server mode (i.e. varied batch size, varied request rate) can be found here. This is implemented in Rust.