Closed nd7141 closed 1 year ago
The InferenceAPIClient
uses this repository as a backend which adds multiple optimisation to make the models faster.
If you want to run the models locally, you can use the following command:
model=OpenAssistant/oasst-sft-1-pythia-12b
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus '"device:0"' --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model
You can then prompt the model with:
from text_generation import Client
client = Client("http://localhost:8080")
resposne = client.generate(query, max_new_tokens=128, do_sample=False)
Thank you @OlivierDehaene!
The InferenceAPIClient uses this repository as a backend which adds multiple optimisation to make the models faster.
I wonder if it's possible to see or know what are these optimizations are? Is it model-related (e.g. loading in 8 bit) or hardware specific (e.g. using tensor parallel)?
You can check out the modelling code for GPT Neox here.
I found that using InferenceAPIClient is about 3x faster than if I use the same model locally.
Locally I use 8 Tesla V100 machine. Local code is:
I get
12.53
seconds.At the same time, if I use
InferenceAPIClient
:Gives
3.96
seconds, which is 3x faster.Is there any tricks I can use with my local models to bring them to the same speed as via APIClient? I'd like to test local models as they've been fine-tuned on additional data.