Open gawain000000 opened 4 months ago
@gawain000000, can you clarify your goals because there are two different solutions for latency and throughput (and low budget) scenarios. I noticed the use of deepspeed.init_inference
and zero stage 3 in your codes, which are not recommended combinations.
@tjruwase My goals are the following:
The reason for this is that when LLMs perform inference on a long document, they need additional memory for storing the KV cache. Currently, I use L40S GPUs to deploy the LLM, and each GPU has only 46GB of memory. If no model slicing is performed, it will result in an OOM error when processing a document around 7000 tokens. I wonder why DeepSpeed's inference initialization allocates the same amount of memory in both cases—using 2 GPUs and 4 GPUs for deployment—making it impossible to process long documents.
Why does the CPU virtual memory usage increase significantly when using more GPUs?
请问随着gpu数量的增加,cpu内存增加解决了吗?
I am experiencing excessive CPU and GPU memory usage when running multi-GPU inference with DeepSpeed. Specifically, the memory usage does not scale as expected when increasing the number of GPUs. Below is the code I am using for inference:
Steps to Reproduce:
Run the script with 2 GPUs:
Run the script with 4 GPUs:
Expected Behavior: I expected that using 4 GPUs would reduce the memory usage per GPU, ideally halving the GPU memory usage compared to running with 2 GPUs.
Actual Behavior:
Questions:
System Info:
Additional Context: Any insights or suggestions on how to optimize the memory usage for multi-GPU inference with DeepSpeed would be greatly appreciated. Thank you!