Closed abhinand5 closed 2 days ago
Packages:
llamafactory 0.8.1.dev0 Transformers 4.41.2 Pytorch 2.2.0+cu121 Datasets 2.19.2 Tokenizers 0.19.1
System:
1x RTX 4090
Environment:
Custom Runpod Image for LLaMA factory
https://github.com/abhinand5/runpod-utils/blob/main/docker/llama-factory/Dockerfile
### model model_name_or_path: abhinand/Llama3-mini-init-fp16 # private model model_revision: main ### method stage: pt do_train: true finetuning_type: full # lora_target: all ### dataset dataset: fineweb_edu_10b cutoff_len: 2048 # --- Used for debugging --- # max_samples: 1000 # --- Longer Debug --- # max_samples: 100000 # -------------------------- overwrite_cache: false preprocessing_num_workers: 12 ### output output_dir: saves/llama3-mini-v0/pt logging_steps: 1 save_steps: 50 plot_loss: true overwrite_output_dir: true save_total_limit: 3 ### train per_device_train_batch_size: 8 gradient_accumulation_steps: 8 learning_rate: 6.0e-4 num_train_epochs: 1.0 lr_scheduler_type: cosine optim: adamw_torch # warmup_steps: 500 warmup_ratio: 0.05 bf16: true # fp16: false` # resize_vocab: true train_from_scratch: true # use_unsloth: false flash_attn: fa2 # packing: false max_grad_norm: 1.0 ### eval val_size: 0.01 per_device_eval_batch_size: 8 eval_accumulation_steps: 8 eval_strategy: steps eval_steps: 50 bf16_full_eval: true ### general push_to_hub: true hub_model_id: abhinand/Llama3-mini-init-pt-internal-v0-test hub_private_repo: true include_tokens_per_second: true include_num_input_tokens_seen: true
Command:
$ llamafactory-cli train ../configs/config1.yaml 2>&1 | tee ../logs/run0.log
Training to start.
But rather it is stuck here:
... 06/15/2024 13:10:33 - WARNING - llamafactory.model.model_utils.attention - FlashAttention-2 is not installed. [INFO|configuration_utils.py:962] 2024-06-15 13:10:33,231 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2 } 06/15/2024 13:10:38 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 06/15/2024 13:10:38 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference. 06/15/2024 13:10:38 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 06/15/2024 13:10:38 - INFO - llamafactory.model.adapter - Fine-tuning method: Full 06/15/2024 13:10:38 - INFO - llamafactory.model.loader - trainable params: 785434624 || all params: 785434624 || trainable%: 100.0000 [INFO|trainer.py:641] 2024-06-15 13:10:39,680 >> Using auto half precision backend
It works when max_samples is set to a small number like 1000 though. Should I wait longer? Already waited 90 minutes and killed the process.
No response
How large is your dataset?
So large. I found the issue. More here https://github.com/huggingface/transformers/issues/31501
Reminder
System Info
Packages:
llamafactory 0.8.1.dev0 Transformers 4.41.2 Pytorch 2.2.0+cu121 Datasets 2.19.2 Tokenizers 0.19.1
System:
1x RTX 4090
Environment:
Custom Runpod Image for LLaMA factory
https://github.com/abhinand5/runpod-utils/blob/main/docker/llama-factory/Dockerfile
Reproduction
Command:
$ llamafactory-cli train ../configs/config1.yaml 2>&1 | tee ../logs/run0.log
Expected behavior
Training to start.
But rather it is stuck here:
It works when max_samples is set to a small number like 1000 though. Should I wait longer? Already waited 90 minutes and killed the process.
Others
No response