Open cyrishe opened 1 year ago
The only change in finetune.py is:
with torch.autocast("cuda"):
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
Without autocast, the training process will stop because of the Precision Format problem.
The same problem, any progress with DDP?
Hi, I got errors when I finetuned with LoRA. Just used the code in 'alpaca-lora' . When run on a single GPU, it worked well. But problems occurred when using multi-gpus. 1 Without DDP, I can run this training task, but the GPUs seem to work serially, with one or two working while the left remains idle.
2 With DDP, 4 GPUs , reported errors: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
, and by making sure allforward
function outputs participate in calculating loss.Note that I am using a ARM64 server, I don't know if this matters. The codes are from the repo, I changed nothing. Below is the script to run the training task, and finetune.py is almost unchanged. WORLD_SIZE=1 TORCH_DISTRIBUTED_DEBUG=INFO torchrun --nproc_per_node=4 finetune.py \ --base_model '../../newtest/model_hub/llama_hf/' \ --data_path 'generate_data/sft_test.json' \ --output_dir './lora-alpaca-code-gen' \ --batch_size 128 \ --micro_batch_size 4 \ --num_epochs 3 \ --learning_rate 1e-4 \ --cutoff_len 512 \ --val_set_size 2000 \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.0 \ --lora_target_modules '[q_proj,v_proj]' \ --train_on_inputs \ --group_by_length