turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 236 forks source link

Trying to quantize a model for the first time. First stage completes, second stage finishes all layers then gives runtime error CUBLAS_STATUS_EXECUTION_FAILED #304

Closed longtimegone closed 5 months ago

longtimegone commented 5 months ago

Here is the last bit of the layers finishing and then the error:

-- Linear: model.layers.61.mlp.down_proj -> 0.05:8b_32g/0.1:4b_128g/0.85:3b_128g s4, 3.39 bpw -- Module quantized, rfn_error: 0.023422 -- Layer: model.norm (RMSNorm) -- Module quantized, rfn_error: 0.000000 -- Layer: lm_head (Linear) -- Linear: lm_head -> 0.15:8b_128g/0.85:6b_128g s4, 6.33 bpw Traceback (most recent call last): File "/media/null/models/exllamav2/convert.py", line 250, in quant(job, save_job, model) File "/home/null/miniconda3/envs/exllama/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/media/null/models/exllamav2/conversion/quantize.py", line 334, in quant quant_lm_head(job, module, hidden_states, quantizers, cache, attn_params) File "/media/null/models/exllamav2/conversion/quantize.py", line 163, in quant_lm_head quant_linear(job, module, quantizers["lm_head"], qp.get_dict()) File "/media/null/models/exllamav2/conversion/quantize.py", line 58, in quant_linear lq.quantize(keep_qweight = True, apply = True, drop = drop) File "/media/null/models/exllamav2/conversion/adaptivegptq.py", line 374, in quantize ext_c.quantize_range(self.quant, RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

I'm assuming I'm doing something wrong here, but I haven't had much luck troubleshooting it.

turboderp commented 5 months ago

Failing to quantize the head layer sounds like there's something strange about the model's vocabulary, maybe an error with how it's being padded. What model is it?

longtimegone commented 5 months ago

It is DarkForest-20B-v1.0

The creator has posted an exl2 quant of it, but only an 8 bit that is too big for my card, so I assume it worked for them.

longtimegone commented 5 months ago

Just to add some data, I tried booting the PC into windows and setting everything up from scratch there, and got the same error message which I have again included below.

Any suggestions about something I might be doing wrong to cause this? Thanks.


-- Layer: model.layers.61 (MLP) -- Linear: model.layers.61.mlp.gate_proj -> 0.1:6b_32g/0.9:5b_32g s4, 5.23 bpw -- Linear: model.layers.61.mlp.up_proj -> 0.25:6b_32g/0.75:5b_32g s4, 5.38 bpw -- Linear: model.layers.61.mlp.down_proj -> 0.05:8b_32g/0.1:6b_32g/0.85:5b_32g s4, 5.38 bpw -- Module quantized, rfn_error: 0.007601 -- Layer: model.norm (RMSNorm) -- Module quantized, rfn_error: 0.000000 -- Layer: lm_head (Linear) -- Linear: lm_head -> 0.15:8b_128g/0.85:6b_128g s4, 6.33 bpw Traceback (most recent call last): File "D:\exllamav2\convert.py", line 250, in quant(job, save_job, model) File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "D:\exllamav2\conversion\quantize.py", line 334, in quant quant_lm_head(job, module, hidden_states, quantizers, cache, attn_params) File "D:\exllamav2\conversion\quantize.py", line 163, in quant_lm_head quant_linear(job, module, quantizers["lm_head"], qp.get_dict()) File "D:\exllamav2\conversion\quantize.py", line 58, in quant_linear lq.quantize(keep_qweight = True, apply = True, drop = drop) File "D:\exllamav2\conversion\adaptivegptq.py", line 374, in quantize ext_c.quantize_range(self.quant, RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

turboderp commented 5 months ago

I've investigated now, and the issue is definitely with the model. The config.json specifies a vocab size of 32003, but the head tensor only has 32000 columns. The embedding table also only has 32000 entries. Likely this is because it was merged from two models with different vocabularies, using the embeddings from one and the config from the other. I'll probably add some sanity checks for this kind of error, but honestly the model merging crowd are relentless and I don't really think I can stay ahead of all the problems they keep causing. :shrug:

In any case, this model won't ever inference correctly since it has no way to embed or output the <|im_end|> and <|im_start|> tokens. The [PAD] token (number 32000) could cause issues when batching, too.

You can still use the model and quantize to EXL2 if you delete/rename the tokenizer.json and added_tokens.json files while changing the "vocab_size" entry in config.json to 32000. You won't be able to use the ChatML prompt formatting which I'm thinking this model was supposed to inherit from Orca (?), but since the relevant embeddings have been stripped out anyway, probably not a big loss.

longtimegone commented 5 months ago

Thanks, I appreciate the help troubleshooting it.

I tried some other models and everything worked without any problem, as you said, it was the broken model causing the problem.