hiyouga / LLaMA-Factory

A WebUI for Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
27.28k stars 3.35k forks source link

DeepSpeed3 Multi-GPU Tokenizing But Not Training #287

Closed aldrinc closed 11 months ago

aldrinc commented 11 months ago

I am facing an issue with the below configuration (was working yesterday and for the last week) where the model loads and dataset is tokenized but then the script hangs (GPU utilization spikes to 100% for all GPUs) but no training starts. Only change was previously I wasn't required to use template but now the CLI forces me to add it.

Terminal log output attached. log_output.txt

pip install --upgrade huggingface_hub
huggingface-cli login --token $HF_TOKEN

setup.sh

git clone https://github.com/hiyouga/LLaMA-Efficient-Tuning.git
cd LLaMA-Efficient-Tuning
pip install -r requirements.txt
pip install bitsandbytes>=0.39.0
pip install scipy
pip install -U git+https://github.com/huggingface/transformers.git
pip install -U git+https://github.com/huggingface/peft.git
pip install deepspeed

Accelerate Config

This Machine
multi-GPU
1 node
No torch dynamo
Deepspeed yes
No config file
Deepspeed Zero3
no offload optimizer
no offload parameters
4 grad accumulation steps
no grad clipping
no deepspeed.zero.Init
6 GPUS (A100-80GB)
FP16

config.yaml

compute_environment: LOCAL_MACHINE                                                                                                                                                                                               
deepspeed_config:                                                                                                                                                                                                                
  gradient_accumulation_steps: 4
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Run command

accelerate launch src/train_bash.py     --stage sft     --model_name_or_path meta-llama/Llama-2-7b-hf     --do_train     --dataset webqa     --finetuning_type lora     --output_dir /workspace/webqa-test     --overwrite_cache     --per_device_train_batch_size 4     --gradient_accumulation_steps 4     --lr_scheduler_type cosine     --logging_steps 10     --save_steps 50     --learning_rate 2e-5     --num_train_epochs 1.0     --plot_loss     --fp16  --quantization_bit 4 --template llama2
hiyouga commented 11 months ago

Currently DeepSpeed ZeRO-3 may be incompatible with 4-bit training, consider using fp16 training instead. BTW, we do not recommend using a non-English corpus to fine-tune the LLaMA-2 models.

aldrinc commented 11 months ago

DeepSpeed Zero-2 should work no? And yeah, just selected webqa to demonstrate the issue - we're using English dataset for actual training.

hiyouga commented 11 months ago

DeepSpeed ZeRO-2 should be compatible with QLoRA.

zhangjunyi111 commented 11 months ago

@aldrinc do you sloved the problem? I meet a same problem

zhangjunyi111 commented 11 months ago

@hiyouga which parameter do you say? Do you say mixed_precision should use fp16?

zhangjunyi111 commented 11 months ago

but utilization is 0 for all gpus

hiyouga commented 11 months ago

@zhangjunyi111 FP16 is required

aldrinc commented 11 months ago

@zhangjunyi111 I did.

FP16 like @hiyouga said worked for me.

Share your config if you still can't solve your issue.

zhangjunyi111 commented 11 months ago

I am confusing . I understand your command ,accelerate launch src/train_bash.py --stage sft --model_name_or_path meta-llama/Llama-2-7b-hf --do_train --dataset webqa --finetuning_type lora --output_dir /workspace/webqa-test --overwrite_cache --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --save_steps 50 --learning_rate 2e-5 --num_train_epochs 1.0 --plot_loss --fp16 --quantization_bit 4 --template llama2, has required fp16.

zhangjunyi111 commented 11 months ago

I sloved my problem by updating my pytorch from 1.3.1 to 2.0.1

zhangjunyi111 commented 11 months ago

Unfortunity, Today I use tow nodes with 13 gpus to run ,i find one is normal ,but the other hangs . But ,when i use 16 gpus, the result is normal ,And i can get the model.

qiankunlee commented 3 months ago

where is config.yaml?