hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.1k stars 3.11k forks source link

BF16 Pretraining Not Starting On Large Datasets #4305

Closed abhinand5 closed 2 days ago

abhinand5 commented 1 week ago

Reminder

System Info

Packages:

llamafactory 0.8.1.dev0 Transformers 4.41.2 Pytorch 2.2.0+cu121 Datasets 2.19.2 Tokenizers 0.19.1

System:

1x RTX 4090

Environment:

Custom Runpod Image for LLaMA factory

https://github.com/abhinand5/runpod-utils/blob/main/docker/llama-factory/Dockerfile

Reproduction

### model
model_name_or_path: abhinand/Llama3-mini-init-fp16 # private model
model_revision: main

### method
stage: pt
do_train: true
finetuning_type: full
# lora_target: all

### dataset
dataset: fineweb_edu_10b
cutoff_len: 2048
# --- Used for debugging ---
# max_samples: 1000
# --- Longer Debug ---
# max_samples: 100000
# --------------------------
overwrite_cache: false
preprocessing_num_workers: 12

### output
output_dir: saves/llama3-mini-v0/pt
logging_steps: 1
save_steps: 50
plot_loss: true
overwrite_output_dir: true
save_total_limit: 3

### train
per_device_train_batch_size: 8
gradient_accumulation_steps: 8
learning_rate: 6.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
optim: adamw_torch
# warmup_steps: 500
warmup_ratio: 0.05
bf16: true
# fp16: false`
# resize_vocab: true
train_from_scratch: true
# use_unsloth: false
flash_attn: fa2
# packing: false
max_grad_norm: 1.0

### eval
val_size: 0.01
per_device_eval_batch_size: 8
eval_accumulation_steps: 8
eval_strategy: steps
eval_steps: 50
bf16_full_eval: true

### general
push_to_hub: true
hub_model_id: abhinand/Llama3-mini-init-pt-internal-v0-test
hub_private_repo: true
include_tokens_per_second: true
include_num_input_tokens_seen: true

Command:

$ llamafactory-cli train ../configs/config1.yaml 2>&1 | tee ../logs/run0.log

Expected behavior

Training to start.

But rather it is stuck here:

...

06/15/2024 13:10:33 - WARNING - llamafactory.model.model_utils.attention - FlashAttention-2 is not installed.
[INFO|configuration_utils.py:962] 2024-06-15 13:10:33,231 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

06/15/2024 13:10:38 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
06/15/2024 13:10:38 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
06/15/2024 13:10:38 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
06/15/2024 13:10:38 - INFO - llamafactory.model.adapter - Fine-tuning method: Full
06/15/2024 13:10:38 - INFO - llamafactory.model.loader - trainable params: 785434624 || all params: 785434624 || trainable%: 100.0000
[INFO|trainer.py:641] 2024-06-15 13:10:39,680 >> Using auto half precision backend

It works when max_samples is set to a small number like 1000 though. Should I wait longer? Already waited 90 minutes and killed the process.

Others

No response

SandroChen commented 1 week ago

How large is your dataset?

abhinand5 commented 4 days ago

So large. I found the issue. More here https://github.com/huggingface/transformers/issues/31501