cyrishe commented 1 year ago

Hi， I got errors when I finetuned with LoRA. Just used the code in 'alpaca-lora' . When run on a single GPU, it worked well. But problems occurred when using multi-gpus. 1 Without DDP, I can run this training task, but the GPUs seem to work serially, with one or two working while the left remains idle.

2 With DDP, 4 GPUs , reported errors: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss.

Note that I am using a ARM64 server, I don't know if this matters. The codes are from the repo, I changed nothing. Below is the script to run the training task, and finetune.py is almost unchanged. WORLD_SIZE=1 TORCH_DISTRIBUTED_DEBUG=INFO torchrun --nproc_per_node=4 finetune.py \ --base_model '../../newtest/model_hub/llama_hf/' \ --data_path 'generate_data/sft_test.json' \ --output_dir './lora-alpaca-code-gen' \ --batch_size 128 \ --micro_batch_size 4 \ --num_epochs 3 \ --learning_rate 1e-4 \ --cutoff_len 512 \ --val_set_size 2000 \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.0 \ --lora_target_modules '[q_proj,v_proj]' \ --train_on_inputs \ --group_by_length

cyrishe commented 1 year ago

The only change in finetune.py is:

trainer.train(resume_from_checkpoint=resume_from_checkpoint)

 with torch.autocast("cuda"):
      trainer.train(resume_from_checkpoint=resume_from_checkpoint)

Without autocast, the training process will stop because of the Precision Format problem.

sergsb commented 9 months ago

The same problem, any progress with DDP?

tloen / alpaca-lora

did not receive grad for rank 2，base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight #575

trainer.train(resume_from_checkpoint=resume_from_checkpoint)