Open allanj opened 4 months ago
Would be nice if someone can tell which version will work
Hi @allanj I don't think we have kernel injection support for llama-2 models. If you remove the --use_kernel
flag does the script work?
Additionally, what kind of GPUs are you using? You may be able to utilize DeepSpeed-MII to run the llama-2 model and get significant improvements to inference performance if you have GPUs with compute capability >=8.0:
import mii
client = mii.serve("meta-llama/Llama-2-7b-hf", tensor_parallel=8)
response = client("test prompt")
Yes. Removing the --use_kernel
make it work.
Yeah, I realize the DeepSpeed FastGen. Wondering, how does it support the batch size? Or I simply make a for loop about that
Version
deepspeed:
0.13.4
transformers:4.38.1
Python:3.10
Pytorch:2.1.2+cu121
CUDA: 12.1Error in Example (To reproduce)
Just simply run this script https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py
It will show the following error:
Potential bug?
I suspect it did not find the right inference engine?, which should be
DeepSpeedLlamaInference
but notDeepSpeedGPTInference
?