Closed firmanmm closed 3 months ago
I seems to manage to get it running by adding CUDA_VISIBLE_DEVICES=1
before running it and manage to start it but somehow I didn't get the advertised speed.
I only manage to get this result :
CPU: 7950X3D
GPU: 4080 Super, 3090
Ram: 192GB, speed limited to 3600
Model: DeepSeek-Coder-V2-Instruct
prompt eval count: 14 token(s)
prompt eval duration: 1.2786610126495361s
prompt eval rate: 10.948953523647642 tokens/s
eval count: 160 token(s)
eval duration: 24.879518270492554s
eval rate: 6.430992684844793 tokens/s
However, I did notice that my ram is not getting filled :
Hi, for now you can try export CUDA_VISIBLE_DEVICE=1
(your second gpu idx) in your bash when you start your program, and leave everything else unchanged.
I seems to manage to get it running by adding
CUDA_VISIBLE_DEVICES=1
before running it and manage to start it but somehow I didn't get the advertised speed. I only manage to get this result : CPU: 7950X3D GPU: 4080 Super, 3090 Ram: 192GB, speed limited to 3600 Model: DeepSeek-Coder-V2-Instructprompt eval count: 14 token(s) prompt eval duration: 1.2786610126495361s prompt eval rate: 10.948953523647642 tokens/s eval count: 160 token(s) eval duration: 24.879518270492554s eval rate: 6.430992684844793 tokens/s
However, I did notice that my ram is not getting filled :
Your RAM seems too low, which may cause extra time to load weights during inference. However, this shouldn’t happen because you have enough RAM for the weights to be loaded into it during the warm-up phase. Can you provide your start command and optimise yaml?
Any idea? @chenht2022
In htop
, green represents Kernel buffers, and orange represents Memory mapped.
The weights of experts in KTransformers are mmap-ed, so this portion of memory usage usually appears as orange. In your screenshot, it accounts for at least 120GB, which is as expected.
Perhaps you can improve performance by increasing parallelism. Try using --cpu_infer 24
.
Hi @chenht2022 , thanks for the suggestion but it seems setting it to 24 seems to reduce the eval speed a bit :
prompt eval count: 14 token(s)
prompt eval duration: 1.4017484188079834s
prompt eval rate: 9.987526871551813 tokens/s
eval count: 160 token(s)
eval duration: 26.098639011383057s
eval rate: 6.130587879705726 tokens/s
Maybe for extra context I'm running it from the main branch. With below command :
CUDA_VISIBLE_DEVICES=1 python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-Coder-V2-Instruct --gguf_path ../newmodel --cpu_infer 24
Since the MoE layer is offloaded to the CPU, its performance bottleneck during the generation phase is the memory bandwidth for reading weights. The comparison with llama.cpp shown in the video was conducted on a machine with an Intel 4th generation Xeon processor, equipped with 8 memory channels, each channel with 4800MHz. Achieving this level of performance places stringent requirements on the hardware. As a reference, we also conducted tests on another setup configured with an Intel 14900KF + 4090D GPU + 192GB of memory, 2 memory channels, each channel with 4000MHz. The prefill performance for ktransformers and llama.cpp were 65.6 tokens/s and 8.93 tokens/s, respectively, the generate performance were 6.07 tokens/s and 1.23 tokens/s, respectively.
p.s.
You can use the following formula to calculate the equivalent memory bandwidth in generate phase. This value will be lower than the theoretical upper limit of the hardware due to scheduling overhead and waiting for some GPU calculations.
Bandwidth = num_hidden_layers * hidden_size * moe_intermediate_size * num_experts_per_tok * (bytes_per_elem_up + bytes_per_elem_gate + bytes_per_elem_down) * tokens_per_second / 10^9
= 60 * 5120 * 1536 * 6 * (0.562500 + 0.562500 + 0.820312) * 6.43 / 10^9
= 35.4 GB/s
Due to the sparsity of DeepSeek's MoE layer, prefill transitions from being memory bandwidth-bound to computation-bound when the prompt is long enough, so that there is a significant speedup compared to the generation phase.
Ah, I see. I thought the video being shared in the README actually just shows a standard consumer desktop machine, which rarely have more than 2 channels. Anyway, thank you for the explanations.
Hi, wanted to ask if there is a way to swap GPU to use the second GPU because currently it will keep using the main GPU, unfortunately for me, my main GPU is 16GB while my second GPU is 24GB. I already tried replacing any mention of
cuda:0
tocuda:1
but it still load in GPU 0 which triggertorch.OutOfMemoryError: CUDA out of memory.
. Thank you for this awesome project.