huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.69k stars 26.22k forks source link

FineWeb SLM Training doesn't start #31501

Open abhinand5 opened 2 months ago

abhinand5 commented 2 months ago

System Info

This is my dev env: https://github.com/abhinand5/runpod-utils/blob/main/docker/torch-lm-dev/Dockerfile

Using the latest docker.

Torch 2.3.1 CUDA 12.1

Who can help?

@ArthurZucker @younesbelkada

Information

Tasks

Reproduction

I am using a custom version of the run_clm.py. Where the only changes are:

  1. Accepted YAML input along with JSON for parsing arguments
  2. Ability to change attention implementation with an argument
  3. Handling validation_split_percentage less than 1% for huge datasets (because 1% is actually a lot).

After the dataset is preprocessed, running the tokenizer and the values are cached...the training takes a ton of time to start. Waited one hour on 4 different occasions and killed the instance cuz as you know GPUs aren't cheap.

It works when I reduce the max train samples to something small like 10000. See DEBUGGING in the YAML below.

Here is my config:

# --- Model settings ---
model_name_or_path: abhinand/personal-model-init-fp16
model_revision: main

# --- Training settings ---
do_train: true
seed: 1337

# --- Cache and torch settings ---
cache_dir: /workspace/.cache
trust_remote_code: true
torch_dtype: bfloat16
attn_implementation: "flash_attention_2"

# --- Dataset settings ---
dataset_name: HuggingFaceFW/fineweb-edu
dataset_config_name: sample-10BT

# --- DEBUGGING ---
# max_train_samples: 1000
# max_steps: 60
# max_eval_samples: 1000
# -----------------

# --- Data processing settings ---
block_size: 2048
overwrite_cache: false
validation_split_percentage: 0.5

# --- Preprocessing settings ---
preprocessing_num_workers: 16
output_dir: ./outputs
overwrite_output_dir: true

# --- Evaluation settings ---
eval_strategy: steps
eval_steps: 50
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
gradient_accumulation_steps: 16
eval_accumulation_steps: 16

# --- Optimizer settings ---
optim: adamw_torch_fused
learning_rate: 3.0e-4
adam_beta1: 0.9
adam_beta2: 0.95
adam_epsilon: 1.0e-8
max_grad_norm: 1.0
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.05
# warmup_steps: 500
weight_decay: 0.01

# --- Logging settings ---
logging_strategy: steps
logging_steps: 1
save_strategy: steps
save_steps: 50
save_total_limit: 5
save_safetensors: true
bf16: true
fp16: false
# bf16_full_eval: true

# --- Torch settings ---
torch_compile: false # not sure why it doesn't work with flash_attention_2
include_tokens_per_second: true
include_num_input_tokens_seen: true

# --- Hub settings ---
push_to_hub: true
hub_model_id: abhinand/personal-model-v0-test1
# hub_strategy: 
hub_private_repo: true

Expected behavior

Training to start...

abhinand5 commented 2 months ago

I found what's causing the initial delay.

Can we just approximate it at every batch? Because you're already tracking num_input_tokens_seen.

So doesn't something like this work, when I only want to see TGS for every step?

For every step:
    num_input_tokens_seen / (end_time_of_step - start_time_of_step)
amyeroberts commented 2 months ago

cc @muellerzr @SunMarc

SunMarc commented 2 months ago

Hi @abhinand5, thanks for the report ! This could indeed makes sense since the include_tokens_per_second needs to iterate over the entire dataset. WDYT @muellerzr for when you will be back.

ArthurZucker commented 1 month ago

also @abhinand5 feel free to open a PR as I believe @muellerzr has not had the time to pick this up yet!

abhinand5 commented 3 weeks ago

Hi @ArthurZucker , sure I'll take this up!