turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

Performance drops with longer prompts? #356

Closed qss-uzair closed 2 weeks ago

qss-uzair commented 4 months ago

I'm running the streaming.py example on an NVidia Jetson Orin AGX, using some version of Mistral 7B. With the original prompt I get 31 t/s, but when I use a longer prompt (~1000 tokens), the performance degrades to 20 t/s.

Is it normal / expected?

qss-uzair commented 4 months ago

moreover, I found something unusual:

Prompt processed in 0.10 seconds, 1441 tokens, 14982.14 tokens/second Response generated in 14.85 seconds, 250 tokens, 16.83 tokens/second

I've tried replacing time.time() with time.perf_counter(), with the same result.

CyberTimon commented 4 months ago

Both is normal - you can measure prompt processing speed very poorly if the prompt is very short and almost instant (1441 in your case is pretty short - try something like 15k and you will get the real speed) Second it's also completely usual that it get's slower the more text you generate as it takes more compute to calculate the next token based on the long sequence / text.

turboderp commented 4 months ago

It does actually look like there's a small bug in the streaming example. It needs a torch.cuda.synchronize() after generator.begin_stream to measure the prompt speed correctly. Otherwise the begin_stream function exits immediately after setting up the CUDA queue while there's still work to do, which would be incorrectly counted as latency for the first token.

Generation speed dropping with longer context is normal, though. There's simply more work to do the more context you have to attend to. Flash Attention mitigates this a bit. Not sure if you can run that on the Jetson.