kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Apache License 2.0
741 stars 39 forks source link

Is deepseek-ai/DeepSeek-V2.5 supported? #79

Closed AshD closed 2 months ago

AshD commented 2 months ago

I tried running bartowski/DeepSeek-V2.5-GGUF on a Linux box with 512GB RAM/192GB VRAM.

_python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2.5 --gguf_path ~/ai/llms/DeepSeek-V2.5-Q80-00001-of-00007.gguf

File "/home/ash/ai/ktransformers/ktransformers/operators/base_operator.py", line 60, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/home/ash/ai/ktransformers/ktransformers/util/utils.py", line 83, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/home/ash/ai/ktransformers/ktransformers/util/utils.py", line 81, in load_weights load_cur_state_dict(module, gguf_loader, prefix) File "/home/ash/ai/ktransformers/ktransformers/util/utils.py", line 76, in load_cur_state_dict raise Exception(f"can't find {translated_key} in GGUF file!") Exception: can't find token_embd.weight in GGUF file!

Azure-Tang commented 2 months ago

You should input dir rather than a gguf file to --gguf_path. And make sure your gguf files end with .gguf~

AshD commented 2 months ago

Thanks @Azure-Tang It works now with ktransformers --model_path deepseek-ai/DeepSeek-V2-Chat --gguf_path ~/ai/llms/DeepSeek-V2.5 --port 8080 --cpu_infer 64

Couple of issues

  1. I tried the streaming OpenAI API and it did not work but Non-Streaming one worked fine. I assume streaming API is not supported.

  2. How do I get the multi-GPU to work? I made a copy of ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml and called it DeepSeek-V2.5-multi-gpu-4.yaml But it loads the model only on 1 of my 4 GPUs (RTX 6000 Ada)

Thanks, Ash

Azure-Tang commented 2 months ago
  1. About streaming api, any idea? @UnicornChan .
  2. You can appoint multi-GPU yaml by --optimize_config_path in server. The detailed yaml writing tutorial here
sammcj commented 2 months ago

I've been trying to get DeepSeek V2.5 working but have hit 'NotImplementedError: ggml_type 21 not implemented' errors:

ls -ltar /mnt/llm/models/deepseek-v25/
total 144G
-rw-r--r-- 1 root apps 37G Sep  9 23:04 DeepSeek-V2.5-IQ2_M-00001-of-00002.gguf
-rw-r--r-- 1 root apps 35G Sep  9 23:05 DeepSeek-V2.5-IQ2_M-00002-of-00002.gguf
-rw-r--r-- 1 root apps 72G Sep 10 08:56 DeepSeek-V2.5-IQ2_M.gguf

ktransformers --model_path deepseek-ai/DeepSeek-V2.5 --gguf_path /mnt/llm/models/deepseek-v25/ --web True --optimize_config_path ./configs/DeepSeek-V2-Chat-multi-gpu-more-vram.yaml

...
Injecting model.layers.59.mlp.shared_experts.act_fn as default
Injecting model.layers.59.input_layernorm as default
Injecting model.layers.59.post_attention_layernorm as default
Injecting model.norm as default
Injecting lm_head as default
loading token_embd.weight to cuda:0
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.12.4/bin/ktransformers", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/llm/ktransformers/git/ktransformers/server/main.py", line 134, in main
    create_interface(config=cfg, default_args=default_args)
  File "/mnt/llm/ktransformers/git/ktransformers/server/utils/create_interface.py", line 27, in create_interface
    GlobalInterface.interface = BackendInterface(default_args)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/llm/ktransformers/git/ktransformers/server/backend/interfaces/ktransformers.py", line 40, in __init__
    optimize_and_load_gguf(self.model, optimize_rule_path, gguf_path, config)
  File "/mnt/llm/ktransformers/git/ktransformers/optimize/optimize.py", line 129, in optimize_and_load_gguf
    load_weights(module, gguf_loader)
  File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 83, in load_weights
    load_weights(child, gguf_loader, prefix+name+".")
  File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 85, in load_weights
    module.load()
  File "/mnt/llm/ktransformers/git/ktransformers/operators/base_operator.py", line 60, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 83, in load_weights
    load_weights(child, gguf_loader, prefix+name+".")
  File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 81, in load_weights
    load_cur_state_dict(module, gguf_loader, prefix)
  File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 71, in load_cur_state_dict
    weights = gguf_loader.load_gguf_tensor(translated_key, device = device).to(dtype = target_dtype)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/llm/ktransformers/git/ktransformers/util/custom_gguf.py", line 286, in load_gguf_tensor
    raise NotImplementedError(f"ggml_type {ggml_type} not implemented")
NotImplementedError: ggml_type 21 not implemented

Not sure if it's a lack of DeepSeek V2 support, or perhaps my config?

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cuda:0"
        prefill_device: "cuda:0"

- match:
    name: "^model\\.layers\\.([0-1][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([2-3][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
- match:
    name: "^model\\.layers\\.([4-5][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:2"
      prefill_device: "cuda:2"

- match:
    name: "^model\\.layers\\.([0-1][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      prefill_device: "cuda:1"
- match:
    name: "^model\\.layers\\.([4-5][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:2"
      prefill_device: "cuda:2"

- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0
      transfer_map:
        20: "cuda:1"
        40: "cuda:2"

- match:
    name: "^model\\.layers\\.([0-1][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([2-3][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
- match:
    name: "(^model\\.layers\\.([4-5][0-9])\\.)|(model.norm)|(lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:2"
      prefill_device: "cuda:2"

*edit: nope, not my config I tried with the default as well

sammcj commented 2 months ago

Oh, I just spotted:

we only support q4_k_m and q8_0 for now, more formats are coming soon

That will be why.

UnicornChan commented 2 months ago

Thanks @Azure-Tang It works now with ktransformers --model_path deepseek-ai/DeepSeek-V2-Chat --gguf_path ~/ai/llms/DeepSeek-V2.5 --port 8080 --cpu_infer 64

Couple of issues

  1. I tried the streaming OpenAI API and it did not work but Non-Streaming one worked fine. I assume streaming API is not supported.
  2. How do I get the multi-GPU to work? I made a copy of ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml and called it DeepSeek-V2.5-multi-gpu-4.yaml But it loads the model only on 1 of my 4 GPUs (RTX 6000 Ada)

Thanks, Ash

Could you please show the code you use with stream? This is the code I use for testing, and the version of OpenAI client is 1.44.1. @AshD

from openai import OpenAI

client = OpenAI(api_key="anywords", base_url="http://localhost:10002/v1")

model = "qwen72b" 
response = client.chat.completions.create(
    model=model,
    messages=[
        {'role': 'user', 'content': "talk a 200 words story"}
    ],
    stream=True
)

for event in response:
    print(event.choices[0].delta.content)
AshD commented 2 months ago
  1. About streaming api, any idea? @UnicornChan .
  2. You can appoint multi-GPU yaml by --optimize_config_path in server. The detailed yaml writing tutorial here

Thanks @Azure-Tang It's working now with --optimize_config_path BTW, the home page, it the model arguments section calls this parameter --optimize_rule_path

AshD commented 2 months ago

Thanks @Azure-Tang It works now with ktransformers --model_path deepseek-ai/DeepSeek-V2-Chat --gguf_path ~/ai/llms/DeepSeek-V2.5 --port 8080 --cpu_infer 64 Couple of issues

  1. I tried the streaming OpenAI API and it did not work but Non-Streaming one worked fine. I assume streaming API is not supported.
  2. How do I get the multi-GPU to work? I made a copy of ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml and called it DeepSeek-V2.5-multi-gpu-4.yaml But it loads the model only on 1 of my 4 GPUs (RTX 6000 Ada)

Thanks, Ash

Could you please show the code you use with stream? This is the code I use for testing, and the version of OpenAI client is 1.44.1. @AshD

from openai import OpenAI

client = OpenAI(api_key="anywords", base_url="http://localhost:10002/v1")

model = "qwen72b" 
response = client.chat.completions.create(
    model=model,
    messages=[
        {'role': 'user', 'content': "talk a 200 words story"}
    ],
    stream=True
)

for event in response:
    print(event.choices[0].delta.content)

I tested this with our Fusion Quill client which uses a custom C# implementation. Will test some other client.

AshD commented 2 months ago

Thanks @Azure-Tang It works now with ktransformers --model_path deepseek-ai/DeepSeek-V2-Chat --gguf_path ~/ai/llms/DeepSeek-V2.5 --port 8080 --cpu_infer 64 Couple of issues

  1. I tried the streaming OpenAI API and it did not work but Non-Streaming one worked fine. I assume streaming API is not supported.
  2. How do I get the multi-GPU to work? I made a copy of ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml and called it DeepSeek-V2.5-multi-gpu-4.yaml But it loads the model only on 1 of my 4 GPUs (RTX 6000 Ada)

Thanks, Ash

Could you please show the code you use with stream? This is the code I use for testing, and the version of OpenAI client is 1.44.1. @AshD

from openai import OpenAI

client = OpenAI(api_key="anywords", base_url="http://localhost:10002/v1")

model = "qwen72b" 
response = client.chat.completions.create(
    model=model,
    messages=[
        {'role': 'user', 'content': "talk a 200 words story"}
    ],
    stream=True
)

for event in response:
    print(event.choices[0].delta.content)

I tested this with our Fusion Quill client which uses a custom C# implementation. Will test some other client.

Thanks @UnicornChan There was an issue where our streaming client was expecting "data: " and ktransformers was sending "data:" without the space. I handled that condition, so I am closing this issue. Thanks for all your help.