turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.67k stars 214 forks source link

Slower tokens/s than expecting #231

Open teknium1 opened 11 months ago

teknium1 commented 11 months ago

Hello I am running a 2x 4090 PC, Windows, with exllama on 7b llama-2.

I am only getting ~70-75t/s during inference (using just 1x 4090), but based on the charts, I should be getting 140+t/s.

What could be causing?

turboderp commented 11 months ago

I know on Windows, Hardware-Accelerated GPU Scheduling can make a big difference to performance, so you might try enabling that.

But even without that you should be seeing more t/s on a single 4090. There could be something keeping the GPU occupied or power limited, or maybe your CPU is very slow?

I recently added the --affinity argument which you could try. It will pin the process to the listed cores, just in case Windows tries to schedule ExLlama on efficiency cores for some reason. E.g. run with --affinity 0,1,2,3,4,5,6,7 or whatever is appropriate for your CPU.

teknium1 commented 11 months ago

hmm, my CPU shouldnt be slow (13700k), but it may not be using everything it needs to, it seems to not be using all cores, image

Do I set --affinity as an arg to any of my inference scripts, it didn't seem to affect the CPU usage or speed much: Time taken for Response: 8.5787 seconds tokens total: 696 tokens/second: 81.12

teknium1 commented 11 months ago

For reference my inference code:


from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import os, glob, time

# Directory containing model, tokenizer, generator

model_directory =  "C:\Teknium\Models\StableBeluga-7B-GPTQ\\"

# Locate files we need within that directory

tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]

# Create config, model, tokenizer and generator

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file

model = ExLlama(config)                                 # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)            # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)                             # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)   # create generator

# Configure generator

#generator.disallow_tokens([tokenizer.eos_token_id])

generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.95
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5

# Produce a simple generation

prompt = "### Instruction:\nWrite a story about a dog getting out of jail\n### Response:\n"
print (prompt, end = "")
start_time = time.time()
output = generator.generate_simple(prompt, max_new_tokens = 2048)
print(output[len(prompt):])
end_time = time.time()  # End timing
elapsed_time = end_time - start_time  # Calculate time taken for the iteration
print(f"Time taken for Response: {elapsed_time:.4f} seconds")
print(f"tokens total: {len(tokenizer.encode(output[len(prompt):]).tolist()[0])}")```
teknium1 commented 11 months ago

Launching that script with --affinity 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 yields these graphs when inferencing: image

turboderp commented 11 months ago

--affinity would only matter if for some reason the OS scheduler isn't doing its job properly and assigning the process to performance cores, which it should do automatically. The fact that some cores are hitting 100% doesn't mean you're CPU bound, though. PyTorch/CUDA will always do that, no matter what. It doesn't yield available CPU time while synchronizing to the GPU.

Do you have hardware-accelerated GPU scheduling enabled? And is there anything else using the same GPU, like an animated Windows wallpaper or something? Long shot, I know, but it's worth ruling it out just to be sure.

teknium1 commented 11 months ago

Will --affinity work no matter if the script directly implements something to handle it?

I am now getting in-line expected speeds for multigpu 70b inference, about 13.5t/s average - and I do get a boost on 7b - from 78->86 tok/s, after upgrading to windows 11, but 7b is still almost 45% slower than it should be. I disabled hardware-accelerated GPU scheduling, will lyk when I restart and it is disabled. I will try isolating to the 2nd gpu that has no display attached and see if the speed is faster, would I do that by setting device_map to [0,24]?

turboderp commented 11 months ago

Hardware accelerated GPU scheduling should preferably be enabled, not disabled. But idk. Windows is odd sometimes.

To run on just the second GPU, yes, set the device map as you suggest.

I'm curious though. Have you tried just running the benchmark script? python test_benchmark_inference.py -d <your model dir> -p? It's possible that it's the sampler slowing you down and not the model itself.

turboderp commented 11 months ago

Also, what NVIDIA driver version are you on? Apparently everyone has been seeing a big drop in performance after version 535.something.

teknium1 commented 11 months ago

My driver is 31.0.15.3667 (Nvidia 536.67)

Will try with benchmark script.

turboderp commented 11 months ago

That's definitely one of the newer drivers that people have been having issues with. You might want to try on 531.x.

teknium1 commented 11 months ago

Will update when I downgrade the drivers and do the benchmark script

teknium1 commented 11 months ago

Updating on benchmark script: Haven't rolled back driver yet


 -- Tokenizer: C:\Teknium\Models\StableBeluga-7B-GPTQ\tokenizer.model
 -- Model config: C:\Teknium\Models\StableBeluga-7B-GPTQ\config.json
 -- Model: C:\Teknium\Models\StableBeluga-7B-GPTQ\gptq_model-4bit-128g.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --sdp_thd: 8
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- Options: ['perf']
 ** Time, Load model: 3.01 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 3,638.47 MB - [cuda:1] 0.00 MB
 ** VRAM, Cache: [cuda:0] 1,024.00 MB - [cuda:1] 0.00 MB
 -- Warmup pass 1...
 ** Time, Warmup: 1.28 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 0.16 seconds
 -- Inference, first pass.
 ** Time, Inference: 0.17 seconds
 ** Speed: 11437.50 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 77.62 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 97.20 tokens/second
 ** VRAM, Inference: [cuda:0] 143.92 MB - [cuda:1] 0.00 MB
 ** VRAM, Total: [cuda:0] 4,806.38 MB - [cuda:1] 0.00 MB```

 77-97 tok/s
turboderp commented 11 months ago

The prompt speed is lower than it should be as well. Kind of suggests the GPU is running slower than it should for some reason.

teknium1 commented 11 months ago

Updated to one driver version newer 536.99, benchmark speed is slightly lower now. Will revert through the last ~5-10 versions next: -- Generating 128 tokens, 1920 token prompt... Speed: 76.58 tokens/second -- Generating 128 tokens, 4 token prompt... Speed: 91.48 tokens/second

Update: I had disabled hardware accelerate graphics thing a while ago, just turned it back on, and now: -- Generating 128 tokens, 1920 token prompt... Speed: 100.54 tokens/second -- Generating 128 tokens, 4 token prompt... Speed: 119.11 tokens/second Still on that latest driver, now will revert through the downgrades until I max it out from driver version. Much closer!

Rolled back to the original driver now, with hardware acceleration on: Driver: 536.67 w/ Hardware Acceleration -- Generating 128 tokens, 1920 token prompt... Speed: 104.43 tokens/second -- Generating 128 tokens, 4 token prompt... Speed: 130.66 tokens/second

Interesting note here: With hardware accel back on, 70b multi-gpu inference takes a big hit, back down to ~11tok/s from 16

Driver version 536.40 w/ Hardware Acceleration -- Generating 128 tokens, 1920 token prompt... Speed: 103.78 tokens/second -- Generating 128 tokens, 4 token prompt... Speed: 124.18 tokens/second

Update 2: Will just add each driver version's benchmarks here now for a comprehensive list in one post lol

Driver Version: 536.23 -- Generating 128 tokens, 1920 token prompt... Speed: 102.12 tokens/second -- Generating 128 tokens, 4 token prompt... Speed: 115.82 tokens/second

Driver Version: 532.03 Speed: 12213.77 tokens/second -- Generating 128 tokens, 1920 token prompt... Speed: 103.64 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 129.70 tokens/second

Driver Version: 531.79 -- Generating 128 tokens, 1920 token prompt... Speed: 102.45 tokens/second -- Generating 128 tokens, 4 token prompt... Speed: 127.56 tokens/second