huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.43k stars 26.14k forks source link

Load llama-2-70b model need too much CPU memory #32051

Closed JuiceLemonLemon closed 6 days ago

JuiceLemonLemon commented 1 month ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

  1. Download Alpaca code. https://github.com/tatsu-lab/stanford_alpaca

  2. Run the command to load Llama2-70b model torchrun --nproc_per_node=8 --master_port=29505 train.py --model_name_or_path ../models/Llama-2-70b-hf/ --data_path ./alpaca_data.json --bf16 True --output_dir ./output --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 1e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True --report_to none

  3. The model cannot be loaded successfully, because the CPU memory has been used > 1T, then the server hangs up.

Expected behavior

It used too much CPU memory when loading Llama-2-70b model on 8 GPUs. How to fix this issue?

amyeroberts commented 1 month ago

Hi @JuiceLemonLemon, thanks for opening this issue!

Without knowing the GPUs your running on, it'd be hard to say what's reasonable in terms of CPU offloading utilization. Have you inspected with tools like nvidia-smi and top to see the memory usage and ensuring the model is loading as expected?

As the command comes from the https://github.com/tatsu-lab/stanford_alpaca repo, I'd suggest opening an issue on this repo, and they'll have more knowledge and experience with the expected behaviour and possible gotchas

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.