huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.92k stars 1.05k forks source link

How to make sure the local tgi server's performance is ok #720

Closed lichangW closed 5 months ago

lichangW commented 1 year ago

Feature request

Hello, I just deployed the tgi server as docs in docker container on an single A100 and have a load test with bloom-7b1, but the performance has come a long way from other inference servers, like vllm, fastertransformer in the same environment & condition. So, if there is something like an official performance table for a beginner like me to make sure the performance is ok, or there are detailed instructions for me to check and set up some options to improve throughput. Thanks a lot!

Motivation

None

Your contribution

None

Narsil commented 1 year ago

has come a long way from other inference servers

What do you mean ? Is it faster or slower ? I'm guessing slower but the phrasing isn't clear to me.

Usually using text-generation-benchmark --tokenizer-name xxxx is our way of checking a given deployment.

What kind of numbers are you seeing ? How are you testing ?

Note: doing benchmarks in general is hard and it's easy to reach a wrong conclusion if you're not understanding what's going on.

lichangW commented 1 year ago

Thanks for reply. Yes, it's much slower than others when we testing on a single A100 environments with same dataset and load test script. I also tested with text-generation-benchmark --tokenizer-name bigscience/bloom-7b1: image

with text-generation-benchmark --tokenizer-name bigscience/bloom-7b1 --decode-length 1024: image

please give valuable suggestions, thanks in advance!

ZhaiFeiyue commented 1 year ago

@Narsil Does text-generation-benchmark also test continuous batching mentioned in router, since we could only set Decode Length and Sequence Length?

Narsil commented 1 year ago

Its doesn't test it per-say as when continuous batching is active many things could be happening at the same time.

But every performance number is dominated by the number of tokens in the decode phase, so this is really what you should be looking at.

ZhaiFeiyue commented 1 year ago

@Narsil thanks

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.