RLOO Checkpoint Issue - Githubissues

asparius commented 5 days ago

System Info

Platform: Linux-3.10.0-1160.62.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.9.5
PyTorch version: 2.4.0
CUDA device(s): NVIDIA A100-SXM4-40GB
Transformers version: 4.46.2
Accelerate version: 1.1.1
Accelerate config: not found
Datasets version: 3.1.0
HF Hub version: 0.26.2
TRL version: 0.12.0
bitsandbytes version: not installed
DeepSpeed version: not installed
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: 1.54.1
PEFT version: not installed

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

python rloo_tldr.py --output_dir models/minimal/rloo_tldr --dataset_name trl-internal-testing/tldr-preference-sft-trl-style --dataset_test_split validation --num_ppo_epochs 2 --num_mini_batches 2 --learning_rate 3e-6 --per_device_train_batch_size 4 --gradient_accumulation_steps 16 --total_episodes 1000 --model_name_or_path EleutherAI/pythia-1b-deduped --sft_model_path cleanrl/EleutherAI_pythia-1b-dedupedsfttldr --reward_model_path cleanrl/EleutherAI_pythia-1b-dedupedrewardtldr --local_rollout_forward_batch_size 16 --missing_eos_penalty 1.0 --stop_token eos --kl_coef 0.03 --save_strategy steps --save_steps 10000 --eval_strategy steps --eval_steps 1000 --report_to wandb

Expected behavior

The above script should not save any checkpoints since training episodes is very low but it still produces checkpoints for every 5 steps similar to the previous RLOO issue #2124.

qgallouedec commented 4 days ago

It should be fixed by #2325. Could you confirm?

asparius commented 4 days ago

Saving issue is solved but training time duration has increased significantly, 1 million episodes taking 300+ hours on A100. Is this expected, is there any reference number to compare with?

qgallouedec commented 1 day ago

I can't reproduce:

# v0.12.1 (includes the fix); transformers 4.47 dev (blue)
/fsx/qgallouedec/trl/examples/scripts/rloo/rloo_tldr.py --output_dir models/minimal/rloo_tldr --dataset_name trl-internal-testing/tldr-preference-sft-trl-style --dataset_test_split validation --num_ppo_epochs 2 --num_mini_batches 2 --learning_rate 3e-6 --per_device_train_batch_size 4 --gradient_accumulation_steps 16 --total_episodes 1000 --model_name_or_path EleutherAI/pythia-1b-deduped --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr --local_rollout_forward_batch_size 16 --missing_eos_penalty 1.0 --stop_token eos --kl_coef 0.03 --save_strategy steps --save_steps 10000 --eval_strategy steps --eval_steps 1000 --report_to wandb

# TRL v0.11 (doesn't include the fix); transformers v4.45 (red)
/fsx/qgallouedec/trl/examples/scripts/rloo/rloo_tldr.py --output_dir models/minimal/rloo_tldr --num_ppo_epochs 2 --num_mini_batches 2 --learning_rate 3e-6 --per_device_train_batch_size 4 --gradient_accumulation_steps 16 --total_episodes 1000 --model_name_or_path EleutherAI/pythia-1b-deduped --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr --local_rollout_forward_batch_size 16 --missing_eos_penalty 1.0 --stop_token eos --kl_coef 0.03 --save_strategy steps --save_steps 10000 --eval_strategy steps --eval_steps 1000 --report_to wandb

W B Chart 14_11_2024, 12_08_20

huggingface / trl

RLOO Checkpoint Issue #2342

System Info

Information

Tasks

Reproduction

Expected behavior