Open Ben-Schneider-code opened 4 weeks ago
I can probably put together a fix for trl when I have some more free time if y'all are interested, since I understand the behaviour now.
Thanks for reporting, help in proposing a fix would be greatly appreciated.
System Info
Information
Tasks
examples
folderReproduction
This issue was reported in the hf transformers repo initially here: https://github.com/huggingface/transformers/issues/29348 I can probably put together a fix for trl when I have some more free time if y'all are interested, since I understand the behaviour now.
Current Behaviour
The base huggingface transformer calls
hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps)
to change the values oftotal_num_steps"
andwarmup_num_steps
from auto to be their calculated value during the inner training loop (when the total_num_steps is know). However, in DPOTrainer iftotal_num_steps
is set to "auto" then the trainer will crash whendeepspeed.initialize
is called when wrapping the ref model atself.ref_model = self._prepare_deepspeed(self.ref_model)
.DS config
Script
Crash log
[2024-10-02 01:18:15,497] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 121.5 GB, percent = 12.1% [2024-10-02 01:18:15,497] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3 rank0: Traceback (most recent call last): rank0: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3489, in
rank0: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3482, in main rank0: globals = debugger.run(setup['file'], None, None, is_module) rank0: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2510, in run rank0: return self._exec(is_module, entry_point_fn, module_name, file, globals, locals) rank0: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2517, in _exec rank0: globals = pydevd_runpy.run_path(file, globals, 'main') rank0: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path rank0: return _run_module_code(code, init_globals, run_name, rank0: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code rank0: _run_code(code, mod_globals, init_globals, rank0: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code rank0: exec(code, run_globals) rank0: File "/home/b3schnei/transformers_debug/debug/29348/reproduce.py", line 34, in
rank0: dpo_trainer = DPOTrainer(
rank0: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
rank0: return f(*args, *kwargs)
rank0: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 883, in init
rank0: self.ref_model = self._prepare_deepspeed(self.ref_model)
rank0: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 924, in _prepare_deepspeed
rank0: model, _ = deepspeed.initialize(model=model, config=config_kwargs)
rank0: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize
rank0: engine = DeepSpeedEngine(args=args,
rank0: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in init
rank0: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 907, in _configure_lr_scheduler rank0: lr_scheduler = self._scheduler_from_config(self.optimizer) rank0: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 962, in _scheduler_from_config rank0: instantiated_scheduler = scheduler(optimizer, **scheduler_params) rank0: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 758, in init rank0: if self.total_num_steps < self.warmup_num_steps:
Expected behavior
I expect the DPOTrainer to initialize under Zero3 when setting ds_config values to "auto" like in transformer's trainer.