Closed jbarker7 closed 8 months ago
AWQ still uses accelerate library to split and that sort of stinks. Try a different loader like llama.cpp or exllama. You can quantize models to 8bit for low loss since you have 4x24 and I assume great infrastructure.
also Windows 11 Pro
For 2 it's gonna depend on if you are talking to it as completions or chat. I think mostly the API exposed is completions so the client must handle instruct prompts, system message, etc. Your choice of model is also a base model so it's going to not be great for chat.
Maybe you should also use an older build? https://github.com/oobabooga/text-generation-webui/issues/4429
@Ph0rk0z I will try a separate loader...
@mykeehu how much older of a build should I try?
Try https://github.com/oobabooga/text-generation-webui/commit/aab0dd962d9f48e9524f8e602f7252a27ed85870 build, the last one that was very fast for me, mid-December release. Turn off streaming before use this version!
Ok - How do I turn off streaming? Apologies, I am new to this.
Used the suggested build and figured out how to turn it off... I just tried it with model "llama-2-70b.Q8_0.gguf" with llama.cpp and I get around 3 tokens/second still... that seems really wrong given the capacity of the machine...
You should get more than that. Using MMQ, non tensor core, llama.cpp I get about 17-18 on fresh context for just 2x3090.
In my case I'm using linux. You should be able to see in llama.cpp output if all 4 GPU are used. Newer windows drivers also have a nasty habit of allocating system ram if going OOM. Make sure to turn off that behavior. On linux I just open nvtop and see how the models loaded and have used up to 5 GPU.. This is why I scoff at windows 11 pro.. just makes things much harder, especially as a beginner.
Here is my log file:
17:49:42-888779 INFO Loading llama-2-70b.Q8_0.gguf
17:49:42-951772 INFO llama.cpp weights detected: models\llama-2-70b.Q8_0.gguf
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from models\llama-2-70b.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q8_0: 562 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 68.26 GiB (8.50 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
Ok - How do I turn off streaming? Apologies, I am new to this.
Turn off in settings-template.yaml file: stream: false
Do not use Q8_0 models ("very large, extremely low quality loss - not recommended" by TheBloke), use Q5_K_M models ("large, very low quality loss - recommended"), download from here:
https://huggingface.co/TheBloke/Llama-2-70B-GGUF
These are good to use because they are not slow and generate well. I have a 3090 card and I use 13B models on it. With 4 cards I don't know how efficient the 70B is.
L.cpp used to show you what GPUs it detected and their compute level. From what it lists you offloaded all layers but did you set tensor split? How did it load onto the GPUs?
@mykeehu Thanks. I'll download this model and give it a shot. It's downloading right now and will report back with results. Do you think a NVLink would help make the 4 more efficient?
@Ph0rk0z The tensor split I set was 24,24,24,24 - it looks like it evenly distributed it across the VRAM of each card.
Here is a screenshot of running nvidia-smi
if that is helpful.
@mykeehu tried with your provided model and disabling streaming in the yaml file -- same results: 2.76 tokens/second :-\
@jbarker7 That's very interesting because I have a 3090 card and a 13B model produces very nice results on https://github.com/oobabooga/text-generation-webui/commit/aab0dd962d9f48e9524f8e602f7252a27ed85870 build. I use Q5_K_M model without stream and 20-30 t/s most of the time (see issue #4429). I don't know how the load is distributed across multiple cards, but the 70B should fit comfortably in your VRAM.
So strange... I reinstalled the NVIDIA driver and now it looks like it shows all 4 gpus, but still getting slow results:
11:23:09-647439 INFO Loading llama-2-70b.Q5_K_M.gguf
11:23:09-882572 INFO llama.cpp weights detected: models\llama-2-70b.Q5_K_M.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from models\llama-2-70b.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 17
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
I tried loading Mistral 7B and it's faster, but still much slower than expected. This is all so strange. I just finished putting this box together specifically for this purpose so it's unfortunate that it isn't working as intended.
Try on ubuntu. Or maybe through WSL.
Although to be fair
Output generated in 21.30 seconds (1.03 tokens/s, 22 tokens, context 756, seed 275697167)
Doesn't mean much because you only generated 22 tokens. Turn off EOS token and generate like 200 tokens to compare.
I think also despite what they say
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
I do not get faster speeds with tensor cores for single batch inference.. still it's not this bad.
You can try exllamav2 and exl2 model.. if in that case you still get slow speeds something is seriously up with your config.
This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
Discussed in https://github.com/oobabooga/text-generation-webui/discussions/5150