Closed AshD closed 2 months ago
You should input dir rather than a gguf file to --gguf_path
. And make sure your gguf files end with .gguf
~
Thanks @Azure-Tang It works now with ktransformers --model_path deepseek-ai/DeepSeek-V2-Chat --gguf_path ~/ai/llms/DeepSeek-V2.5 --port 8080 --cpu_infer 64
Couple of issues
I tried the streaming OpenAI API and it did not work but Non-Streaming one worked fine. I assume streaming API is not supported.
How do I get the multi-GPU to work? I made a copy of ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml and called it DeepSeek-V2.5-multi-gpu-4.yaml But it loads the model only on 1 of my 4 GPUs (RTX 6000 Ada)
Thanks, Ash
--optimize_config_path
in server. The detailed yaml writing tutorial hereI've been trying to get DeepSeek V2.5 working but have hit 'NotImplementedError: ggml_type 21 not implemented' errors:
ls -ltar /mnt/llm/models/deepseek-v25/
total 144G
-rw-r--r-- 1 root apps 37G Sep 9 23:04 DeepSeek-V2.5-IQ2_M-00001-of-00002.gguf
-rw-r--r-- 1 root apps 35G Sep 9 23:05 DeepSeek-V2.5-IQ2_M-00002-of-00002.gguf
-rw-r--r-- 1 root apps 72G Sep 10 08:56 DeepSeek-V2.5-IQ2_M.gguf
ktransformers --model_path deepseek-ai/DeepSeek-V2.5 --gguf_path /mnt/llm/models/deepseek-v25/ --web True --optimize_config_path ./configs/DeepSeek-V2-Chat-multi-gpu-more-vram.yaml
...
Injecting model.layers.59.mlp.shared_experts.act_fn as default
Injecting model.layers.59.input_layernorm as default
Injecting model.layers.59.post_attention_layernorm as default
Injecting model.norm as default
Injecting lm_head as default
loading token_embd.weight to cuda:0
Traceback (most recent call last):
File "/root/.pyenv/versions/3.12.4/bin/ktransformers", line 8, in <module>
sys.exit(main())
^^^^^^
File "/mnt/llm/ktransformers/git/ktransformers/server/main.py", line 134, in main
create_interface(config=cfg, default_args=default_args)
File "/mnt/llm/ktransformers/git/ktransformers/server/utils/create_interface.py", line 27, in create_interface
GlobalInterface.interface = BackendInterface(default_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/llm/ktransformers/git/ktransformers/server/backend/interfaces/ktransformers.py", line 40, in __init__
optimize_and_load_gguf(self.model, optimize_rule_path, gguf_path, config)
File "/mnt/llm/ktransformers/git/ktransformers/optimize/optimize.py", line 129, in optimize_and_load_gguf
load_weights(module, gguf_loader)
File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 83, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 85, in load_weights
module.load()
File "/mnt/llm/ktransformers/git/ktransformers/operators/base_operator.py", line 60, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 83, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 81, in load_weights
load_cur_state_dict(module, gguf_loader, prefix)
File "/mnt/llm/ktransformers/git/ktransformers/util/utils.py", line 71, in load_cur_state_dict
weights = gguf_loader.load_gguf_tensor(translated_key, device = device).to(dtype = target_dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/llm/ktransformers/git/ktransformers/util/custom_gguf.py", line 286, in load_gguf_tensor
raise NotImplementedError(f"ggml_type {ggml_type} not implemented")
NotImplementedError: ggml_type 21 not implemented
Not sure if it's a lack of DeepSeek V2 support, or perhaps my config?
- match:
name: "^model.embed_tokens"
replace:
class: "default"
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([0-1][0-9])\\."
class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbedding
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([2-3][0-9])\\."
class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbedding
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
- match:
name: "^model\\.layers\\.([4-5][0-9])\\."
class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbedding
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
- match:
name: "^model\\.layers\\.([0-1][0-9])\\.(?!self_attn).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
prefill_device: "cuda:1"
- match:
name: "^model\\.layers\\.([4-5][0-9])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
- match:
name: "^model$"
replace:
class: "ktransformers.operators.models.KDeepseekV2Model"
kwargs:
per_layer_prefill_intput_threshold: 0
transfer_map:
20: "cuda:1"
40: "cuda:2"
- match:
name: "^model\\.layers\\.([0-1][0-9])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([2-3][0-9])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
- match:
name: "(^model\\.layers\\.([4-5][0-9])\\.)|(model.norm)|(lm_head)"
replace:
class: "default"
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
*edit: nope, not my config I tried with the default as well
Oh, I just spotted:
we only support q4_k_m and q8_0 for now, more formats are coming soon
That will be why.
Thanks @Azure-Tang It works now with ktransformers --model_path deepseek-ai/DeepSeek-V2-Chat --gguf_path ~/ai/llms/DeepSeek-V2.5 --port 8080 --cpu_infer 64
Couple of issues
- I tried the streaming OpenAI API and it did not work but Non-Streaming one worked fine. I assume streaming API is not supported.
- How do I get the multi-GPU to work? I made a copy of ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml and called it DeepSeek-V2.5-multi-gpu-4.yaml But it loads the model only on 1 of my 4 GPUs (RTX 6000 Ada)
Thanks, Ash
Could you please show the code you use with stream? This is the code I use for testing, and the version of OpenAI client is 1.44.1. @AshD
from openai import OpenAI
client = OpenAI(api_key="anywords", base_url="http://localhost:10002/v1")
model = "qwen72b"
response = client.chat.completions.create(
model=model,
messages=[
{'role': 'user', 'content': "talk a 200 words story"}
],
stream=True
)
for event in response:
print(event.choices[0].delta.content)
- About streaming api, any idea? @UnicornChan .
- You can appoint multi-GPU yaml by
--optimize_config_path
in server. The detailed yaml writing tutorial here
Thanks @Azure-Tang It's working now with --optimize_config_path BTW, the home page, it the model arguments section calls this parameter --optimize_rule_path
Thanks @Azure-Tang It works now with ktransformers --model_path deepseek-ai/DeepSeek-V2-Chat --gguf_path ~/ai/llms/DeepSeek-V2.5 --port 8080 --cpu_infer 64 Couple of issues
- I tried the streaming OpenAI API and it did not work but Non-Streaming one worked fine. I assume streaming API is not supported.
- How do I get the multi-GPU to work? I made a copy of ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml and called it DeepSeek-V2.5-multi-gpu-4.yaml But it loads the model only on 1 of my 4 GPUs (RTX 6000 Ada)
Thanks, Ash
Could you please show the code you use with stream? This is the code I use for testing, and the version of OpenAI client is 1.44.1. @AshD
from openai import OpenAI client = OpenAI(api_key="anywords", base_url="http://localhost:10002/v1") model = "qwen72b" response = client.chat.completions.create( model=model, messages=[ {'role': 'user', 'content': "talk a 200 words story"} ], stream=True ) for event in response: print(event.choices[0].delta.content)
I tested this with our Fusion Quill client which uses a custom C# implementation. Will test some other client.
Thanks @Azure-Tang It works now with ktransformers --model_path deepseek-ai/DeepSeek-V2-Chat --gguf_path ~/ai/llms/DeepSeek-V2.5 --port 8080 --cpu_infer 64 Couple of issues
- I tried the streaming OpenAI API and it did not work but Non-Streaming one worked fine. I assume streaming API is not supported.
- How do I get the multi-GPU to work? I made a copy of ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml and called it DeepSeek-V2.5-multi-gpu-4.yaml But it loads the model only on 1 of my 4 GPUs (RTX 6000 Ada)
Thanks, Ash
Could you please show the code you use with stream? This is the code I use for testing, and the version of OpenAI client is 1.44.1. @AshD
from openai import OpenAI client = OpenAI(api_key="anywords", base_url="http://localhost:10002/v1") model = "qwen72b" response = client.chat.completions.create( model=model, messages=[ {'role': 'user', 'content': "talk a 200 words story"} ], stream=True ) for event in response: print(event.choices[0].delta.content)
I tested this with our Fusion Quill client which uses a custom C# implementation. Will test some other client.
Thanks @UnicornChan
There was an issue where our streaming client was expecting "data:
I tried running bartowski/DeepSeek-V2.5-GGUF on a Linux box with 512GB RAM/192GB VRAM.
_python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2.5 --gguf_path ~/ai/llms/DeepSeek-V2.5-Q80-00001-of-00007.gguf
File "/home/ash/ai/ktransformers/ktransformers/operators/base_operator.py", line 60, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/home/ash/ai/ktransformers/ktransformers/util/utils.py", line 83, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/home/ash/ai/ktransformers/ktransformers/util/utils.py", line 81, in load_weights load_cur_state_dict(module, gguf_loader, prefix) File "/home/ash/ai/ktransformers/ktransformers/util/utils.py", line 76, in load_cur_state_dict raise Exception(f"can't find {translated_key} in GGUF file!") Exception: can't find token_embd.weight in GGUF file!