Open GitIgnoreMaybe opened 3 days ago
The reproduction command is not posted, we don't know what process you are doing.
Hey @codemayq,
Thanks for the help.
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path microsoft/Phi-3-small-8k-instruct \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--quantization_bit 4 \
--quantization_method bitsandbytes \
--template phi \
--flash_attn fa2 \
--dataset_dir data \
--dataset custom_instruct_training_data.json \
--cutoff_len 1024 \
--learning_rate 1.0e-04 \
--num_train_epochs 1.0 \
--max_samples 1000 \
--per_device_train_batch_size 5 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--optim adamw_torch \
--packing False \
--report_to none \
--output_dir saves/Phi3-7B-8k-Chat/lora/train_2024-07-05-13-47-27 \
--bf16 True \
--plot_loss True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--lora_rank 256 \
--lora_alpha 512 \
--lora_dropout 0 \
--lora_target all \
--val_size 0.1 \
--eval_strategy steps \
--eval_steps 100 \
--per_device_eval_batch_size 5
~I think my LoRA rank and LoRA alpha was wrong.~
decrease the train batch size
@hiyouga Thanks for the help.
This didn't work either. But I figured out that the quantization creates the issue. It works when I'm not quantizing. Sounds like a bug, right?
Failing with this:
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path microsoft/Phi-3-small-8k-instruct \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--quantization_bit 4 \
--quantization_method bitsandbytes \
--template phi \
--flash_attn fa2 \
--dataset_dir data \
--dataset data_query_expansion.json \
--cutoff_len 512 \
--learning_rate 0.0001 \
--num_train_epochs 8.0 \
--max_samples 1000 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--optim adamw_torch \
--packing False \
--report_to none \
--output_dir saves/Phi3-7B-8k-Chat/lora/output-q4 \
--bf16 True \
--plot_loss True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all \
--val_size 0.1 \
--eval_strategy steps \
--eval_steps 100 \
--per_device_eval_batch_size 1
This worked:
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path microsoft/Phi-3-small-8k-instruct \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--quantization_method bitsandbytes \
--template phi \
--flash_attn fa2 \
--dataset_dir data \
--dataset data_query_expansion.json \
--cutoff_len 512 \
--learning_rate 0.0001 \
--num_train_epochs 8.0 \
--max_samples 1000 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--optim adamw_torch \
--packing False \
--report_to none \
--output_dir saves/Phi3-7B-8k-Chat/lora/output-q4 \
--bf16 True \
--plot_loss True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all \
--val_size 0.1 \
--eval_strategy steps \
--eval_steps 100 \
--per_device_eval_batch_size 1
Reminder
System Info
llamafactory-0.8.3.dev0, Ubuntu 22.04.3 LTS, py3.10, cuda11.8.0
Reproduction
Command:
Error
Expected behavior
Hello, I'm really not sure if this is a LLaMA Factory issue or the Cloud GPU provider. Does anyone knows what to do?
Others
No response