Incorrect results for exllama

arbi-dev commented 1 year ago

This is a nice tool, but I believe the exllama figures are incorrect -- there is an issue with the benchmarking, which, when corrected, shows that exllama is almost twice as fast as ctranslate2 with this particular model (llama2-7b). Here is what I noticed:

Running the benchmark code on an RTX 4090 indeed shows about 2.4 seconds per 200 tokens for Ctranslate2 and 6.5 seconds for exllama (using text-gen-webui server with the openai-api compatible endpoint).
However at the same time the text-gen-webui server itself reported very different figures, about 1.4 seconds per 200 tokens, for exllama.
I hypothesized that the additional latency results from something in the bench.py for exllama, probably the tqdm progress bar tool.
To test the hypothesis, I added a simple time counter for the start and end of each generation in both the exllama and ctranslate2 scripts.
The results (copied below) showed that the server figures above were correct and exllama is actually significantly faster than ctranslate2. Of course another big benefit of exllama is that it allows 4-bit quantization which halves the vram requirements!

hamelsmu commented 1 year ago

Thanks for opening this. I updated to the the new textgen-webui and indeed its much, faster. I didn't find that tqdm slowed anything for me though. I will update this. Also I couldn't get ggml working in fresh installation of textgen-webui

Updating the post shortly

hamelsmu commented 1 year ago

(I'm going to also remove tqdm just incase, I do notice a slight slowdown when doing an a/b test between both)

hamelsmu commented 1 year ago

I updated the post. Thank you!

hamelsmu / llama-inference

Incorrect results for exllama #2