turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

Support C4AI Command-R+ #400

Closed alexbrowngh closed 1 month ago

alexbrowngh commented 2 months ago

I was wondering if you might consider adding support for the recently released Command-R+ model. From what I understand, it appears to be one of the most advanced and capable models currently available in the open-source community. I would greatly appreciate it if you could look into this possibility. Thank you very much for your time and consideration.

turboderp commented 2 months ago

I'm already working on it.

image

It's a big model though, and making sure that everything works on (some number of) 24 GB GPUs is tricky. But the architecture is not too different from Command-R which is already supported. Main difference is the normalization of Q and K heads during attention, which is trivial.

Technically it loads at the moment (and seems to work) with load_in_q4, but that takes all of my 3x24 GB of VRAM and leaves no room for context. So as advanced and capable as the model may be, you'll need some advanced and capable hardware too.

Ph0rk0z commented 2 months ago

There were posts on both LMG and reddit saying they were getting a segfault when quanting. Here's hoping for a 5-bit, it has the potential to unseat midnight miqu for chats as long as the alignment doesn't override the system prompt.

atisharma commented 2 months ago

I'm willing to test. I got a segfault when trying to quant. I tried on A6000 and 4090.

Ph0rk0z commented 2 months ago

I'm guessing the GPTQ version from alpinedale doesn't run even thought GPTQ are supported?

turboderp commented 2 months ago

The dev branch should now support cmdr+. I've uploaded some quants here. Note that if you want to make more, you'll need at least 2x24 GB of VRAM to quantize it. 1x48 GB should also work.

I'll do a bit more testing and maybe some optimization before merging and releasing.

Omegastick commented 2 months ago

I've confirmed the 5.0bpw loads on an 80GB A100 with 128k context and 4bit cache.

2024-04-06T07:41:19.444303236-05:00 Output generated in 50.31 seconds (0.99 tokens/s, 50 tokens, context 43483, seed 859447966)
2024-04-06T08:01:59.375193653-05:00 Output generated in 61.04 seconds (0.82 tokens/s, 50 tokens, context 113592, seed 1204682100)

It takes ~15 minutes to process the 113,000 token prompt.

Looks like Ooba doesn't support the tokenizer_config.json format used here yet. I had to change these lines to be a single string rather than a list.

"chat_template": "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}"
atisharma commented 2 months ago

I'm unable to quantize, failing at the last minute:

 -- Module quantized, calibration perplexity (quant): 7.1733
 -- Saving checkpoint...
 -- Compiling output file...
 -- Writing shard 1...
 -- Writing shard 2...
 -- Writing shard 3...
 -- Writing shard 4...
 -- Writing shard 5...
 -- Writing shard 6...
 -- Writing shard 7...
 -- Writing shard 8...
 -- Saved model weights:
 --   /srv/models/agalmic/c4ai-command-r-plus-6.0bpw-h6-exl2/output-00001-of-00008.safetensors (8,112 MB)
 --   /srv/models/agalmic/c4ai-command-r-plus-6.0bpw-h6-exl2/output-00002-of-00008.safetensors (8,119 MB)
 --   /srv/models/agalmic/c4ai-command-r-plus-6.0bpw-h6-exl2/output-00003-of-00008.safetensors (8,070 MB)
 --   /srv/models/agalmic/c4ai-command-r-plus-6.0bpw-h6-exl2/output-00004-of-00008.safetensors (8,167 MB)
 --   /srv/models/agalmic/c4ai-command-r-plus-6.0bpw-h6-exl2/output-00005-of-00008.safetensors (8,112 MB)
 --   /srv/models/agalmic/c4ai-command-r-plus-6.0bpw-h6-exl2/output-00006-of-00008.safetensors (8,172 MB)
 --   /srv/models/agalmic/c4ai-command-r-plus-6.0bpw-h6-exl2/output-00007-of-00008.safetensors (8,174 MB)
 --   /srv/models/agalmic/c4ai-command-r-plus-6.0bpw-h6-exl2/output-00008-of-00008.safetensors (2,497 MB)
Traceback (most recent call last):
  File "/user/exllamav2/convert.py", line 272, in <module>
    compile_model(job, save_job, model)
  File "/opt/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/user/exllamav2/conversion/compile.py", line 231, in compile_model
    with open(config_json, "r") as f:
         ^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/srv/models/agalmic/c4ai-command-r-plus-6.0bpw-h6-exl2/config.json'
turboderp commented 2 months ago

@atisharma That seems to be a bug. Try the latest commit.

Also, it should be done converting at the point where it failed, with all the files already written to the output directory.

Ph0rk0z commented 2 months ago

Why is it so much larger than same size frankenmerges? Is it the extra vocabulary?

turboderp commented 2 months ago

Vocabulary adds a lot, yes. But of the weights in the safetensors, about 6 GB are only loaded to system RAM. So subtract that. And since the head layer is quantized separately (usually at 6bpw) it accounts for a little more space than normal given that it's 3B parameters vs, 0.5B in Goliath etc.

atisharma commented 2 months ago

@atisharma That seems to be a bug. Try the latest commit.

Thanks. That ran fine. I have related question, how did you generate tokenizer_config.json? Did you manually construct it?

turboderp commented 2 months ago

@atisharma I found a PR on the model page: here

Ph0rk0z commented 2 months ago

So I can subtract 6gb from file size and run the 4.5bpw? I was worried it only left ~5gb open for context since the file size was 67gb. If it's 61gb it may be a different story.

turboderp commented 2 months ago

If it's in 3x24 GB I'm not sure. That's my setup here, too, and 4.0bpw pretty much maxes it out at 32k context.

mmealman commented 2 months ago

Really nice work. I can fit 11k context on the 3bpw quant on dual 3090's with 15.5 tokens per second:

python test_inference.py -l 11264 -gs 22,24 -t 512 -eq4 -m models/turboderp_command-r-plus-103B-exl2_3.0bpw -p "Once upon a time,"
Ph0rk0z commented 2 months ago

Well.. it loads and generates correctly. Have to check on template to make sure it's right, etc. 4.0 did like 23/22/17 with 16K but in textgen it detected as only an 8k ctx model.

ptb_new on 2048 is 8.0 so that is normal. Model is replying ok but I have to test more back and forth dialogue. It's not as verbose or flowery as the API but that is probably from my settings. Overall it appears to be performing similar to midnight-miqu 103b but running faster. Saw 13t/s over 3x24gb with about 3k of CTX.

mmealman commented 2 months ago

It's easy to work with in ExUI. Just remove exllamav2 from requirements.txt and manually install it instead:

pip install git+https://github.com/turboderp/exllamav2.git@dev
pip install tokenizers

When I forget to reset the chat template to the Cohere chat prompt in ExUI, the model has issues. So it does appear to be sensitive to the prompt(where Miqu isn't). It feels like a smart model at 3.0bpw after chatting with it here and there. No blow outs with short context conversations. Though the model feels oddly precise in how it operates. Perhaps because it's for RAG and enterprise use.

Screenshot_20240406_212303

atisharma commented 2 months ago

Works well with tabbyAPI.

yamosin commented 2 months ago

i got error in quant this model

 -- Resuming job
 !! Note: Overriding options with settings from existing job
 -- Input: d:\command-r-plus
 -- Output: d:\cmdr
 -- Calibration dataset: ruozhiba.parquet, 32 / 16 rows, 2048 tokens per sample
 -- Target bits per weight: 4.5 (decoder), 6 (head)
 -- Max shard size: 8192 MB
 -- Compiling output file...
 -- Writing shard 1...
Traceback (most recent call last):
  File "E:\exllamav2\convert.py", line 272, in <module>
    compile_model(job, save_job, model)
  File "e:\miniconda3\envs\tb\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "E:\exllamav2\conversion\compile.py", line 157, in compile_model
    save_file(save_dict, out_filename)
  File "e:\miniconda3\envs\tb\Lib\site-packages\safetensors\torch.py", line 281, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
                   ^^^^^^^^^^^^^^^^^
  File "e:\miniconda3\envs\tb\Lib\site-packages\safetensors\torch.py", line 485, in _flatten
    return {
           ^
  File "e:\miniconda3\envs\tb\Lib\site-packages\safetensors\torch.py", line 489, in <dictcomp>
    "data": _tobytes(v, k),
            ^^^^^^^^^^^^^^
  File "e:\miniconda3\envs\tb\Lib\site-packages\safetensors\torch.py", line 428, in _tobytes
    data = np.ctypeslib.as_array(newptr, (total_bytes,))  # no internal copy
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "e:\miniconda3\envs\tb\Lib\site-packages\numpy\ctypeslib.py", line 521, in as_array
    p_arr_type = ctypes.POINTER(_ctype_ndarray(obj._type_, shape))
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "e:\miniconda3\envs\tb\Lib\site-packages\numpy\ctypeslib.py", line 354, in _ctype_ndarray
    element_type = dim * element_type
                   ~~~~^~~~~~~~~~~~~~
ValueError: Array length must be >= 0, not -2298478592

https://github.com/turboderp/exllamav2/issues/152#issuecomment-1831205252 I found a familiar issue but I need spec -l, cuz I got this issue if i dont spec -l under 50 when start quant

 ** Warning: Not enough sample data in ruozhiba.parquet
Traceback (most recent call last):
  File "E:\exllamav2\convert.py", line 252, in <module>
    tokenize(job, save_job, tokenizer)
  File "E:\exllamav2\conversion\tokenize.py", line 45, in tokenize
    cal_tokens = get_tokens(rows, length, cal_ds, tokenizer)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\exllamav2\conversion\tokenize.py", line 25, in get_tokens
    all_tokens = all_tokens.view((num_rows, length))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[100, 2048]' is invalid for input of size 122366

With -l 32 now it done the quant but report error when save file

turboderp commented 2 months ago

ValueError: Array length must be >= 0, not -2298478592

This is a known issue with the safetensors library. I raised it here but it didn't get resolved, I think, and then it was closed as stale. So idk how to proceed, really. :shrug: It appears to be a bug in numpy that only manifests on Windows.

turboderp commented 2 months ago

The other error I think is because the calibration dataset you're using doesn't produce enough tokens for the default size of 100 rows of 2048 tokens. -r 59 should work in that case. But it's a pretty small calibration set, and I don't recommend using a dataset in general over the builtin set.

yamosin commented 2 months ago

ValueError: Array length must be >= 0, not -2298478592

This is a known issue with the safetensors library. I raised it here but it didn't get resolved, I think, and then it was closed as stale. So idk how to proceed, really. 🤷 It appears to be a bug in numpy that only manifests on Windows.

Thanks, looks like is this issue. I'm saving the safetenser file in wsl after modifying job_new.json after completing quanting in windows, and while this is ten times slower than saving in windows, at least I don't have to spend another couple hours re-quanting in wsl.