kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Apache License 2.0
745 stars 39 forks source link

Add support to switch main GPU #21

Closed firmanmm closed 3 months ago

firmanmm commented 3 months ago

Hi, wanted to ask if there is a way to swap GPU to use the second GPU because currently it will keep using the main GPU, unfortunately for me, my main GPU is 16GB while my second GPU is 24GB. I already tried replacing any mention of cuda:0 to cuda:1 but it still load in GPU 0 which trigger torch.OutOfMemoryError: CUDA out of memory. . Thank you for this awesome project.

firmanmm commented 3 months ago

I seems to manage to get it running by adding CUDA_VISIBLE_DEVICES=1 before running it and manage to start it but somehow I didn't get the advertised speed. I only manage to get this result : CPU: 7950X3D GPU: 4080 Super, 3090 Ram: 192GB, speed limited to 3600 Model: DeepSeek-Coder-V2-Instruct

prompt eval count:    14 token(s)
prompt eval duration: 1.2786610126495361s
prompt eval rate:     10.948953523647642 tokens/s
eval count:           160 token(s)
eval duration:        24.879518270492554s
eval rate:            6.430992684844793 tokens/s

However, I did notice that my ram is not getting filled : image

Azure-Tang commented 3 months ago

Hi, for now you can try export CUDA_VISIBLE_DEVICE=1 (your second gpu idx) in your bash when you start your program, and leave everything else unchanged.

Azure-Tang commented 3 months ago

I seems to manage to get it running by adding CUDA_VISIBLE_DEVICES=1 before running it and manage to start it but somehow I didn't get the advertised speed. I only manage to get this result : CPU: 7950X3D GPU: 4080 Super, 3090 Ram: 192GB, speed limited to 3600 Model: DeepSeek-Coder-V2-Instruct

prompt eval count:    14 token(s)
prompt eval duration: 1.2786610126495361s
prompt eval rate:     10.948953523647642 tokens/s
eval count:           160 token(s)
eval duration:        24.879518270492554s
eval rate:            6.430992684844793 tokens/s

However, I did notice that my ram is not getting filled : image

Your RAM seems too low, which may cause extra time to load weights during inference. However, this shouldn’t happen because you have enough RAM for the weights to be loaded into it during the warm-up phase. Can you provide your start command and optimise yaml?

Any idea? @chenht2022

chenht2022 commented 3 months ago

In htop, green represents Kernel buffers, and orange represents Memory mapped. The weights of experts in KTransformers are mmap-ed, so this portion of memory usage usually appears as orange. In your screenshot, it accounts for at least 120GB, which is as expected.

Perhaps you can improve performance by increasing parallelism. Try using --cpu_infer 24.

firmanmm commented 3 months ago

Hi @chenht2022 , thanks for the suggestion but it seems setting it to 24 seems to reduce the eval speed a bit :

prompt eval count:    14 token(s)
prompt eval duration: 1.4017484188079834s
prompt eval rate:     9.987526871551813 tokens/s
eval count:           160 token(s)
eval duration:        26.098639011383057s
eval rate:            6.130587879705726 tokens/s

Maybe for extra context I'm running it from the main branch. With below command :

CUDA_VISIBLE_DEVICES=1 python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-Coder-V2-Instruct --gguf_path ../newmodel --cpu_infer 24
chenht2022 commented 3 months ago

Since the MoE layer is offloaded to the CPU, its performance bottleneck during the generation phase is the memory bandwidth for reading weights. The comparison with llama.cpp shown in the video was conducted on a machine with an Intel 4th generation Xeon processor, equipped with 8 memory channels, each channel with 4800MHz. Achieving this level of performance places stringent requirements on the hardware. As a reference, we also conducted tests on another setup configured with an Intel 14900KF + 4090D GPU + 192GB of memory, 2 memory channels, each channel with 4000MHz. The prefill performance for ktransformers and llama.cpp were 65.6 tokens/s and 8.93 tokens/s, respectively, the generate performance were 6.07 tokens/s and 1.23 tokens/s, respectively.

p.s.

firmanmm commented 3 months ago

Ah, I see. I thought the video being shared in the README actually just shows a standard consumer desktop machine, which rarely have more than 2 channels. Anyway, thank you for the explanations.