Closed longtimegone closed 5 months ago
Failing to quantize the head layer sounds like there's something strange about the model's vocabulary, maybe an error with how it's being padded. What model is it?
It is DarkForest-20B-v1.0
The creator has posted an exl2 quant of it, but only an 8 bit that is too big for my card, so I assume it worked for them.
Just to add some data, I tried booting the PC into windows and setting everything up from scratch there, and got the same error message which I have again included below.
Any suggestions about something I might be doing wrong to cause this? Thanks.
-- Layer: model.layers.61 (MLP)
-- Linear: model.layers.61.mlp.gate_proj -> 0.1:6b_32g/0.9:5b_32g s4, 5.23 bpw
-- Linear: model.layers.61.mlp.up_proj -> 0.25:6b_32g/0.75:5b_32g s4, 5.38 bpw
-- Linear: model.layers.61.mlp.down_proj -> 0.05:8b_32g/0.1:6b_32g/0.85:5b_32g s4, 5.38 bpw
-- Module quantized, rfn_error: 0.007601
-- Layer: model.norm (RMSNorm)
-- Module quantized, rfn_error: 0.000000
-- Layer: lm_head (Linear)
-- Linear: lm_head -> 0.15:8b_128g/0.85:6b_128g s4, 6.33 bpw
Traceback (most recent call last):
File "D:\exllamav2\convert.py", line 250, in cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
I've investigated now, and the issue is definitely with the model. The config.json specifies a vocab size of 32003, but the head tensor only has 32000 columns. The embedding table also only has 32000 entries. Likely this is because it was merged from two models with different vocabularies, using the embeddings from one and the config from the other. I'll probably add some sanity checks for this kind of error, but honestly the model merging crowd are relentless and I don't really think I can stay ahead of all the problems they keep causing. :shrug:
In any case, this model won't ever inference correctly since it has no way to embed or output the <|im_end|>
and <|im_start|>
tokens. The [PAD]
token (number 32000) could cause issues when batching, too.
You can still use the model and quantize to EXL2 if you delete/rename the tokenizer.json and added_tokens.json files while changing the "vocab_size" entry in config.json to 32000. You won't be able to use the ChatML prompt formatting which I'm thinking this model was supposed to inherit from Orca (?), but since the relevant embeddings have been stripped out anyway, probably not a big loss.
Thanks, I appreciate the help troubleshooting it.
I tried some other models and everything worked without any problem, as you said, it was the broken model causing the problem.
Here is the last bit of the layers finishing and then the error:
-- Linear: model.layers.61.mlp.down_proj -> 0.05:8b_32g/0.1:4b_128g/0.85:3b_128g s4, 3.39 bpw -- Module quantized, rfn_error: 0.023422 -- Layer: model.norm (RMSNorm) -- Module quantized, rfn_error: 0.000000 -- Layer: lm_head (Linear) -- Linear: lm_head -> 0.15:8b_128g/0.85:6b_128g s4, 6.33 bpw Traceback (most recent call last): File "/media/null/models/exllamav2/convert.py", line 250, in
quant(job, save_job, model)
File "/home/null/miniconda3/envs/exllama/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/media/null/models/exllamav2/conversion/quantize.py", line 334, in quant
quant_lm_head(job, module, hidden_states, quantizers, cache, attn_params)
File "/media/null/models/exllamav2/conversion/quantize.py", line 163, in quant_lm_head
quant_linear(job, module, quantizers["lm_head"], qp.get_dict())
File "/media/null/models/exllamav2/conversion/quantize.py", line 58, in quant_linear
lq.quantize(keep_qweight = True, apply = True, drop = drop)
File "/media/null/models/exllamav2/conversion/adaptivegptq.py", line 374, in quantize
ext_c.quantize_range(self.quant,
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
I'm assuming I'm doing something wrong here, but I haven't had much luck troubleshooting it.