Responses are slow and also weird/awkward

jbarker7 commented 10 months ago

Discussed in https://github.com/oobabooga/text-generation-webui/discussions/5150

^{Originally posted by **jbarker7** January 2, 2024} I've got Oobabooga up and running no problem and just finished a custom server build for our AI lab using the following hardware: Threadripper 3960 (24 cores) 256 GB RAM 4x 3090s (24 GB RAM each) Windows 11 Pro We also tried using TheBloke_Llama-2-70B-AWQ as the model with AutoAWQ as the loader We are having two problems: 1. It seems to be incredibly slow - 3 tokens/second. I would have expected much much faster on 4 3090s. 2. When using the API using the --public-api switch, it gives crazy results that don't make any sense that are also very short. First time I've had a model actually swear like a sailor :P Any ideas?

Ph0rk0z commented 10 months ago

AWQ still uses accelerate library to split and that sort of stinks. Try a different loader like llama.cpp or exllama. You can quantize models to 8bit for low loss since you have 4x24 and I assume great infrastructure.

also Windows 11 Pro

For 2 it's gonna depend on if you are talking to it as completions or chat. I think mostly the API exposed is completions so the client must handle instruct prompts, system message, etc. Your choice of model is also a base model so it's going to not be great for chat.

mykeehu commented 10 months ago

Maybe you should also use an older build? https://github.com/oobabooga/text-generation-webui/issues/4429

jbarker7 commented 10 months ago

@Ph0rk0z I will try a separate loader...

@mykeehu how much older of a build should I try?

mykeehu commented 10 months ago

Try https://github.com/oobabooga/text-generation-webui/commit/aab0dd962d9f48e9524f8e602f7252a27ed85870 build, the last one that was very fast for me, mid-December release. Turn off streaming before use this version!

jbarker7 commented 10 months ago

Ok - How do I turn off streaming? Apologies, I am new to this.

jbarker7 commented 10 months ago

Used the suggested build and figured out how to turn it off... I just tried it with model "llama-2-70b.Q8_0.gguf" with llama.cpp and I get around 3 tokens/second still... that seems really wrong given the capacity of the machine...

Ph0rk0z commented 10 months ago

You should get more than that. Using MMQ, non tensor core, llama.cpp I get about 17-18 on fresh context for just 2x3090.

In my case I'm using linux. You should be able to see in llama.cpp output if all 4 GPU are used. Newer windows drivers also have a nasty habit of allocating system ram if going OOM. Make sure to turn off that behavior. On linux I just open nvtop and see how the models loaded and have used up to 5 GPU.. This is why I scoff at windows 11 pro.. just makes things much harder, especially as a beginner.

jbarker7 commented 10 months ago

Here is my log file:

17:49:42-888779 INFO Loading llama-2-70b.Q8_0.gguf 17:49:42-951772 INFO llama.cpp weights detected: models\llama-2-70b.Q8_0.gguf llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from models\llama-2-70b.Q8_0.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 llama_model_loader: - kv 4: llama.block_count u32 = 80 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 7 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 18: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q8_0: 562 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 68.26 GiB (8.50 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.28 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: system memory used = 265.90 MiB llm_load_tensors: VRAM used = 69630.66 MiB llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 1280.00 MB llama_new_context_with_model: KV self size = 1280.00 MiB, K (f16): 640.00 MiB, V (f16): 640.00 MiB llama_build_graph: non-view tensors processed: 1684/1684 llama_new_context_with_model: compute buffer total size = 571.19 MiB llama_new_context_with_model: VRAM scratch buffer: 568.00 MiB llama_new_context_with_model: total VRAM used: 71478.66 MiB (model: 69630.66 MiB, context: 1848.00 MiB) AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | 17:50:29-134892 INFO LOADER: llama.cpp 17:50:29-136893 INFO TRUNCATION LENGTH: 4096 17:50:29-137896 INFO INSTRUCTION TEMPLATE: Alpaca 17:50:29-140898 INFO Loaded the model in 46.25 seconds. Output generated in 24.74 seconds (3.19 tokens/s, 79 tokens, context 226, seed 1027999990)

mykeehu commented 10 months ago

Ok - How do I turn off streaming? Apologies, I am new to this.

Turn off in settings-template.yaml file: stream: false Do not use Q8_0 models ("very large, extremely low quality loss - not recommended" by TheBloke), use Q5_K_M models ("large, very low quality loss - recommended"), download from here: https://huggingface.co/TheBloke/Llama-2-70B-GGUF

These are good to use because they are not slow and generate well. I have a 3090 card and I use 13B models on it. With 4 cards I don't know how efficient the 70B is.

Ph0rk0z commented 10 months ago

L.cpp used to show you what GPUs it detected and their compute level. From what it lists you offloaded all layers but did you set tensor split? How did it load onto the GPUs?

jbarker7 commented 10 months ago

@mykeehu Thanks. I'll download this model and give it a shot. It's downloading right now and will report back with results. Do you think a NVLink would help make the 4 more efficient?

@Ph0rk0z The tensor split I set was 24,24,24,24 - it looks like it evenly distributed it across the VRAM of each card.

jbarker7 commented 10 months ago

Here is a screenshot of running nvidia-smi if that is helpful.

Screenshot 2024-01-03 072504

jbarker7 commented 10 months ago

@mykeehu tried with your provided model and disabling streaming in the yaml file -- same results: 2.76 tokens/second :-\

mykeehu commented 10 months ago

@jbarker7 That's very interesting because I have a 3090 card and a 13B model produces very nice results on https://github.com/oobabooga/text-generation-webui/commit/aab0dd962d9f48e9524f8e602f7252a27ed85870 build. I use Q5_K_M model without stream and 20-30 t/s most of the time (see issue #4429). I don't know how the load is distributed across multiple cards, but the 70B should fit comfortably in your VRAM.

jbarker7 commented 10 months ago

So strange... I reinstalled the NVIDIA driver and now it looks like it shows all 4 gpus, but still getting slow results:

11:23:09-647439 INFO Loading llama-2-70b.Q5_K_M.gguf 11:23:09-882572 INFO llama.cpp weights detected: models\llama-2-70b.Q5_K_M.gguf ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 4 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6 Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6 Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6 Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6 llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from models\llama-2-70b.Q5_K_M.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 llama_model_loader: - kv 4: llama.block_count u32 = 80 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 17 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 18: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 45.40 GiB (5.65 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.28 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: system memory used = 172.15 MiB llm_load_tensors: VRAM used = 46322.61 MiB llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 1280.00 MB llama_new_context_with_model: KV self size = 1280.00 MiB, K (f16): 640.00 MiB, V (f16): 640.00 MiB llama_build_graph: non-view tensors processed: 1684/1684 llama_new_context_with_model: compute buffer total size = 571.19 MiB llama_new_context_with_model: VRAM scratch buffer: 568.00 MiB llama_new_context_with_model: total VRAM used: 48170.61 MiB (model: 46322.61 MiB, context: 1848.00 MiB) AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | 11:23:41-475857 INFO LOADER: llama.cpp 11:23:41-475857 INFO TRUNCATION LENGTH: 4096 11:23:41-475857 INFO INSTRUCTION TEMPLATE: Alpaca 11:23:41-475857 INFO Loaded the model in 31.83 seconds. Output generated in 21.30 seconds (1.03 tokens/s, 22 tokens, context 756, seed 275697167)

jbarker7 commented 10 months ago

I tried loading Mistral 7B and it's faster, but still much slower than expected. This is all so strange. I just finished putting this box together specifically for this purpose so it's unfortunate that it isn't working as intended.

Ph0rk0z commented 10 months ago

Try on ubuntu. Or maybe through WSL.

Although to be fair Output generated in 21.30 seconds (1.03 tokens/s, 22 tokens, context 756, seed 275697167)

Doesn't mean much because you only generated 22 tokens. Turn off EOS token and generate like 200 tokens to compare.

I think also despite what they say

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes

I do not get faster speeds with tensor cores for single batch inference.. still it's not this bad.

You can try exllamav2 and exl2 model.. if in that case you still get slow speeds something is seriously up with your config.

github-actions[bot] commented 8 months ago

This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

oobabooga / text-generation-webui

Responses are slow and also weird/awkward #5151

Discussed in https://github.com/oobabooga/text-generation-webui/discussions/5150