How to achieve the same speed as in InferenceAPIClient?

nd7141 commented 1 year ago

I found that using InferenceAPIClient is about 3x faster than if I use the same model locally.

Locally I use 8 Tesla V100 machine. Local code is:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, StoppingCriteria, StoppingCriteriaList

model_name = "OpenAssistant/oasst-sft-1-pythia-12b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
def wrap(message):
    return "<|prompter|>"+message+"<|endoftext|><|assistant|>"

query = wrap('''
What families with kids love about this hotel? 
Located on prestigious Park Lane in Mayfair, the London Hilton on Park Lane Hotel features stunning views of Hyde Park, Knightsbridge and Westminster. With 453 rooms and suites, this award-winning 5-star luxury hotel also has a stylish bar and a Michelin-starred restaurant.
With marble bathrooms and a flat-screen TV, some rooms also have balconies. Executive rooms and suites offer Executive Lounge access with complimentary a continental breakfast, snacks and beverages throughout the day as well as internet access and private check-in.
Boasting some of London’s finest restaurants and bars, the Galvin at Windows serves a menu with an emphasis on British cuisine.
The stylish Podium Restaurant and Bar serves seasonal British cuisine and is famed for the Confessions of a Chocoholic afternoon tea.
The London Hilton on Park Lane also has a business center, a steam room, and a sauna.
London Hilton on Park Lane is undergoing a phased renovation between February and July 2023 to elevate your experience. The hotel will remain open during this period and you can expect the same wonderful care and attention when you visit us. During this time, some services and areas will present temporary changes to normal operations, which will include the lobby and our all-day dining restaurant. Please note that breakfast will be served on our first floor with views overlooking Hyde Park. Please enter the hotel using the back lobby entrance on Hertford Street. Please email reservations.parklane@hilton.com for further information.
This is our guests' favorite part of London, according to independent reviews.
Couples in particular like the location – they rated it 9.3 for a two-person trip.
''')

data = tokenizer(query, return_tensors="pt")
start = time.time()
outputs = model.generate(**data, max_new_tokens=128, num_beams=1, do_sample=False)
print(tokenizer.decode(outputs[0]))
end = time.time()
print(end-start)

I get 12.53 seconds.

At the same time, if I use InferenceAPIClient:

from text_generation import InferenceAPIClient

model_name = "OpenAssistant/oasst-sft-1-pythia-12b"
client = InferenceAPIClient(model_name)

start = time.time()
response = client.generate(query, max_new_tokens=128, do_sample=False)
print(time.time() - start)

Gives 3.96 seconds, which is 3x faster.

Is there any tricks I can use with my local models to bring them to the same speed as via APIClient? I'd like to test local models as they've been fine-tuned on additional data.

OlivierDehaene commented 1 year ago

The InferenceAPIClientuses this repository as a backend which adds multiple optimisation to make the models faster.

If you want to run the models locally, you can use the following command:

model=OpenAssistant/oasst-sft-1-pythia-12b
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus '"device:0"' --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model

You can then prompt the model with:

from text_generation import Client

client = Client("http://localhost:8080")

resposne = client.generate(query, max_new_tokens=128, do_sample=False)

nd7141 commented 1 year ago

Thank you @OlivierDehaene!

The InferenceAPIClient uses this repository as a backend which adds multiple optimisation to make the models faster.

I wonder if it's possible to see or know what are these optimizations are? Is it model-related (e.g. loading in 8 bit) or hardware specific (e.g. using tensor parallel)?

OlivierDehaene commented 1 year ago

You can check out the modelling code for GPT Neox here.

huggingface / text-generation-inference

How to achieve the same speed as in InferenceAPIClient? #192