turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Error trying to quantize cognitivecomputations/dolphin-2.9.1-qwen-110b #453

Closed bablat closed 3 weeks ago

bablat commented 1 month ago

stdout included:

 -- Beginning new job
 !! Warning: Output directory is not empty: XXX
 !! Cleaning output directory: XXX
 -- Input: cognitivecomputations_dolphin-2.9.1-qwen-110b
 -- Output: tmpdq
 -- Using default calibration dataset
 -- Target bits per weight: 5.6 (decoder), 8 (head)
 -- Max shard size: 8192 MB
 -- Full model will be compiled to: cognitivecomputations_dolphin-2.9.1-qwen-110b-5.6bpw-exl2/
 -- Tokenizing samples (measurement)...
Traceback (most recent call last):
  File "/storage/textgen/models/../exllamav2/convert.py", line 209, in <module>
    tokenize(job, save_job, tokenizer, measure = True)
  File "/storage/textgen/exllamav2/conversion/tokenize.py", line 47, in tokenize
    cal_tokens = get_standard_calibration(measure, tokenizer)
  File "/storage/textgen/exllamav2/conversion/tokenize.py", line 94, in get_standard_calibration
    tokenized_articles = [tokenizer.encode(a, add_bos = True, add_eos = True) for a in articles]
  File "/storage/textgen/exllamav2/conversion/tokenize.py", line 94, in <listcomp>
    tokenized_articles = [tokenizer.encode(a, add_bos = True, add_eos = True) for a in articles]
  File "/storage/textgen/exllamav2/exllamav2/tokenizer/tokenizer.py", line 405, in encode
    ids = torch.tensor(ids).to(torch.long).unsqueeze(0)
RuntimeError: Could not infer dtype of NoneType

Help would be appreciated.

turboderp commented 1 month ago

Is this on the latest version? From the stack trace it appears to be fairly old.

bablat commented 1 month ago

I may have messed up with a few different venvs, but I believe this is with .21:

 -- Beginning new job
 !! Warning: Output directory is not empty: tmpdq
 !! Cleaning output directory: tmpdq
 -- Input: cognitivecomputations_dolphin-2.9.1-qwen-110b
 -- Output: tmpdq
 -- Using default calibration dataset
 -- Target bits per weight: 5.6 (decoder), 8 (head)
 -- Max shard size: 8192 MB
 -- Full model will be compiled to: cognitivecomputations_dolphin-2.9.1-qwen-110b-5.6bpw-exl2/
 -- Tokenizing samples (measurement)...
Traceback (most recent call last):
  File "/storage/textgen/models/../exllamav2/convert.py", line 221, in <module>
    tokenize(job, save_job, tokenizer, measure = True)
  File "/storage/textgen/exllamav2/conversion/tokenize.py", line 47, in tokenize
    cal_tokens = get_standard_calibration(job, measure, tokenizer)
  File "/storage/textgen/exllamav2/conversion/tokenize.py", line 96, in get_standard_calibration
    tokenized_articles = [tokenizer.encode(a, add_bos = True, add_eos = True) for a in articles]
  File "/storage/textgen/exllamav2/conversion/tokenize.py", line 96, in <listcomp>
    tokenized_articles = [tokenizer.encode(a, add_bos = True, add_eos = True) for a in articles]
  File "/storage/textgen/exllamav2/exllamav2/tokenizer/tokenizer.py", line 418, in encode
    ids = torch.tensor(ids).to(torch.long).unsqueeze(0)
RuntimeError: Could not infer dtype of NoneType
yamosin commented 1 month ago

I got same issue with 0.0.21, for 0.0.20 it report hessian error

turboderp commented 1 month ago

So I looked into it, and the issue is that from 0.0.21 ExLlama no longer uses token ID 0 as a fallback when the model doesn't define a BOS token. Simple fix is to add "bos_token_id": 151644, to config.json, and it should start fine.

The Hessian error I'm assuming is because it runs out of memory. Qwen-110B is simply too big to quantize on a 24 GB GPU.

yamosin commented 1 month ago

So I looked into it, and the issue is that from 0.0.21 ExLlama no longer uses token ID 0 as a fallback when the model doesn't define a BOS token. Simple fix is to add "bos_token_id": 151644, to config.json, and it should start fine.

The Hessian error I'm assuming is because it runs out of memory. Qwen-110B is simply too big to quantize on a 24 GB GPU.

by add "bos_token_id": 151644 to config.json, RuntimeError: Could not infer dtype of NoneType is fixed and hessian error, i got this on 4x3090 So it needs more than 24G on a single card? I can quantize cmdr+, and it can transfer to a second card if it runs low on video memory while quantizing, but dolphin-qwen give hessian error

---------------------------------------------
| Measured: model.layers.0 (Attention)      |
| Duration: 103.03 seconds                  |
| Completed step: 1/163                     |
| Avg time / step (rolling): 103.03 seconds |
| Estimated remaining time: 278min 10sec    |
| Last checkpoint layer: None               |
---------------------------------------------
 -- Layer: model.layers.0 (MLP)
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
Traceback (most recent call last):
  File "e:\exllamav2\conversion\adaptivegptq.py", line 292, in prepare
    hessian_inv = torch.linalg.cholesky(hessian)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 45058 is not positive-definite).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "e:\exllamav2\convert.py", line 240, in <module>
    status = measure_quant(job, save_job, model)  # capturing the graceful exits
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\tabbyAPI\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "e:\exllamav2\conversion\measure.py", line 563, in measure_quant
    m = measure_mlp(module, hidden_states, target_states, quantizers, cache, attn_params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "e:\exllamav2\conversion\measure.py", line 204, in measure_mlp
    quantizers["down_proj"].prepare()
  File "e:\exllamav2\conversion\adaptivegptq.py", line 330, in prepare
    raise ValueError("Hessian is not invertible")
ValueError: Hessian is not invertible
waterangel91 commented 1 month ago

Might not related to this but is there any function/setting i can use to set eos token to a list of tokens/ ids? Currently i use the stream method and manually validate the output token.

bablat commented 3 weeks ago

I just encountered the same issue with today's new Qwens, the bos_token_id fix @ config.json works, I'll close this issue. Thanks for your huge contribution to the community @turboderp.