CPU OOM when inferencing Llama3-70B-Chinese-Chat

Code: text-generation demo Command: deepspeed --num_gpus 2 inference-test.py --dtype float16 --batch_size 4 --max_new_tokens 200 --model ../Llama3-70B-Chinese-Chat Hardware: two A100 80GB GPUs, CPU 250GB Problem: When using Deepspeed to load the float16 model, it consumes too much CPU memory, and 250GB of memory cannot load the 70B model. When I use the built-in model of Transformers for inference, _Model=AutoModelForCausalLM. from_pretrained (model_id, torch dtype=torch. float16, devicemap="auto"), can perform inference without occupying CPU memory. How to reduce CPU memory usage？

microsoft / DeepSpeedExamples

CPU OOM when inferencing Llama3-70B-Chinese-Chat #904