huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.21k stars 1.3k forks source link

Error when loading DDPOTrainer using DeepSpeed #2388

Closed coder109 closed 3 days ago

coder109 commented 6 days ago

System Info

Information

Tasks

Reproduction

How to reproduce?

A short code snippet:

from trl import DDPOConfig, DDPOTrainer, DefaultDDPOStableDiffusionPipeline
from dataclasses import dataclass, field
from transformers import HfArgumentParser, TrainingArguments
from typing import Optional
from transformers import MODEL_FOR_CAUSAL_LM_MAPPING

MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(
        default=None,
    )
    model_type: Optional[str] = field(
        default=None,
    )
    config_overrides: Optional[str] = field(
        default=None,
    )
    config_name: Optional[str] = field(
        default=None,
    )
    tokenizer_name: Optional[str] = field(
        default=None,
    )
    cache_dir: Optional[str] = field(
        default=None,
    )
    use_fast_tokenizer: bool = field(
        default=True,
    )
    model_revision: str = field(
        default="main",
    )
    use_auth_token: bool = field(
        default=False,
    )
    torch_dtype: Optional[str] = field(
        default=None,
    )

@dataclass
class DataTrainingArguments:
    dataset_name: Optional[str] = field(
        default=None,
    )
    dataset_config_name: Optional[str] = field(
        default=None,
    )
    train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
    validation_file: Optional[str] = field(
        default=None,
    )
    max_train_samples: Optional[int] = field(
        default=None,
    )
    max_eval_samples: Optional[int] = field(
        default=None,
    )
    streaming: bool = field(default=False)
    block_size: Optional[int] = field(
        default=None,
    )
    overwrite_cache: bool = field(
        default=False
    )
    validation_split_percentage: Optional[int] = field(
        default=5,
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
    )
    keep_linebreaks: bool = field(
        default=True
    )

@dataclass
class ScriptArguments:
    pretrained_model: str = field(
        default="runwayml/stable-diffusion-v1-5", metadata={"help": "the pretrained model to use"}
    )
    pretrained_revision: str = field(default="main", metadata={"help": "the pretrained model revision to use"})
    hf_hub_model_id: str = field(
        default=None, metadata={"help": "HuggingFace repo to save model weights to"}
    )
    hf_hub_aesthetic_model_id: str = field(
        default=None,
        metadata={"help": "HuggingFace model ID for aesthetic scorer model weights"},
    )
    hf_hub_aesthetic_model_filename: str = field(
        default=None,
        metadata={"help": "HuggingFace model filename for aesthetic scorer model weights"},
    )
    use_lora: bool = field(default=True, metadata={"help": "Whether to use LoRA."})

def reward_fn():
    def _fn():
        return 1

    return _fn

def prompt_fn():
    def _fn():
        return

    return _fn

def load_sd_related_model(reward_fn, prompt_fn):
    script_cfg = ScriptArguments(
        pretrained_model="/home/export/base/ycsc_chenkh/chenkh/online1/andongchen/vision_MT/mmt_image/RAG-MT/models/stable-diffusion-2-1-base",
    )
    train_cfg = DDPOConfig(
        num_epochs = 3,
        logdir = "./blah",
        train_batch_size = 3,
        sample_batch_size = 3,
        train_learning_rate = 1e-5,
    )
    train_cfg.project_kwargs = {
        "logging_dir": "./blah",
        "automatic_checkpoint_naming": True,
        "total_limit": 5,
        "project_dir": "./blahblah"
    }

    pipe = DefaultDDPOStableDiffusionPipeline(
        script_cfg.pretrained_model,
        pretrained_model_revision=script_cfg.pretrained_revision,
        use_lora=script_cfg.use_lora,
    )
    pipe.set_progress_bar_config(disable=True)

    trainer = DDPOTrainer(
        train_cfg,
        prompt_function=prompt_fn(),
        reward_function=reward_fn(),  # Use the reward function defined above
        sd_pipeline=pipe,
    )
    return pipe, trainer

if __name__ == "__main__":
    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    script_cfg = parser.parse_args_into_dataclasses()[0]
    pipe, trainer = load_sd_related_model(reward_fn, prompt_fn)

This python file can be run using this script under Linux:

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=enp83s0f1
export NCCL_IB_GID_INDEX=3
export NCCL_IB_SL=3
export NCCL_NET_GDR_READ=1
export DS_SKIP_CUDA_CHECK=1

export MASTER_ADDR="${CHIEF_IP:=localhost}"
export MASTER_PORT="${MASTER_PORT:=11451}"
export HOST_NUM=1
export INDEX=0

# HOST_NUM will be 1
torchrun --nnodes $HOST_NUM --node_rank $INDEX --nproc_per_node 4 --master_addr $MASTER_ADDR --master_port $MASTER_PORT  \
    issue.py \
    --model_name_or_path "./blah" \
    --deepspeed ./deepspeed_config_zero2.json \
    --train_file 30k_en_fr.json \
    --output_dir "./blah" 

outputs:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/online1/ycsc_chenkh/chenkh/andongchen/vision_MT/mmt_image/RAG-MT/ParroT/transformers/examples/pytorch/language-modeling/issue.py", line 147, in <module>
[rank1]:     pipe, trainer = load_sd_related_model(reward_fn, prompt_fn)
[rank1]:   File "/online1/ycsc_chenkh/chenkh/andongchen/vision_MT/mmt_image/RAG-MT/ParroT/transformers/examples/pytorch/language-modeling/issue.py", line 135, in load_sd_related_model
[rank1]:     trainer = DDPOTrainer(
[rank1]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/trl/trainer/ddpo_trainer.py", line 184, in __init__
[rank1]:     unet, self.optimizer = self.accelerator.prepare(trainable_layers, self.optimizer)
[rank1]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank1]:     result = self._prepare_deepspeed(*args)
[rank1]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/accelerate/accelerator.py", line 1654, in _prepare_deepspeed
[rank1]:     raise ValueError(
[rank1]: ValueError: When using DeepSpeed, `accelerate.prepare()` requires you to pass at least one of training or evaluation dataloaders with `batch_size` attribute returning an integer value or alternatively set an integer value in `train_micro_batch_size_per_gpu` in the deepspeed config file or assign integer value to `AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu']`.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/online1/ycsc_chenkh/chenkh/andongchen/vision_MT/mmt_image/RAG-MT/ParroT/transformers/examples/pytorch/language-modeling/issue.py", line 147, in <module>
[rank3]:     pipe, trainer = load_sd_related_model(reward_fn, prompt_fn)
[rank3]:   File "/online1/ycsc_chenkh/chenkh/andongchen/vision_MT/mmt_image/RAG-MT/ParroT/transformers/examples/pytorch/language-modeling/issue.py", line 135, in load_sd_related_model
[rank3]:     trainer = DDPOTrainer(
[rank3]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/trl/trainer/ddpo_trainer.py", line 184, in __init__
[rank3]:     unet, self.optimizer = self.accelerator.prepare(trainable_layers, self.optimizer)
[rank3]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank3]:     result = self._prepare_deepspeed(*args)
[rank3]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/accelerate/accelerator.py", line 1654, in _prepare_deepspeed
[rank3]:     raise ValueError(
[rank3]: ValueError: When using DeepSpeed, `accelerate.prepare()` requires you to pass at least one of training or evaluation dataloaders with `batch_size` attribute returning an integer value or alternatively set an integer value in `train_micro_batch_size_per_gpu` in the deepspeed config file or assign integer value to `AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu']`.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/online1/ycsc_chenkh/chenkh/andongchen/vision_MT/mmt_image/RAG-MT/ParroT/transformers/examples/pytorch/language-modeling/issue.py", line 147, in <module>
[rank2]:     pipe, trainer = load_sd_related_model(reward_fn, prompt_fn)
[rank2]:   File "/online1/ycsc_chenkh/chenkh/andongchen/vision_MT/mmt_image/RAG-MT/ParroT/transformers/examples/pytorch/language-modeling/issue.py", line 135, in load_sd_related_model
[rank2]:     trainer = DDPOTrainer(
[rank2]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/trl/trainer/ddpo_trainer.py", line 184, in __init__
[rank2]:     unet, self.optimizer = self.accelerator.prepare(trainable_layers, self.optimizer)
[rank2]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank2]:     result = self._prepare_deepspeed(*args)
[rank2]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/accelerate/accelerator.py", line 1654, in _prepare_deepspeed
[rank2]:     raise ValueError(
[rank2]: ValueError: When using DeepSpeed, `accelerate.prepare()` requires you to pass at least one of training or evaluation dataloaders with `batch_size` attribute returning an integer value or alternatively set an integer value in `train_micro_batch_size_per_gpu` in the deepspeed config file or assign integer value to `AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu']`.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/online1/ycsc_chenkh/chenkh/andongchen/vision_MT/mmt_image/RAG-MT/ParroT/transformers/examples/pytorch/language-modeling/issue.py", line 147, in <module>
[rank0]:     pipe, trainer = load_sd_related_model(reward_fn, prompt_fn)
[rank0]:   File "/online1/ycsc_chenkh/chenkh/andongchen/vision_MT/mmt_image/RAG-MT/ParroT/transformers/examples/pytorch/language-modeling/issue.py", line 135, in load_sd_related_model
[rank0]:     trainer = DDPOTrainer(
[rank0]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/trl/trainer/ddpo_trainer.py", line 184, in __init__
[rank0]:     unet, self.optimizer = self.accelerator.prepare(trainable_layers, self.optimizer)
[rank0]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank0]:     result = self._prepare_deepspeed(*args)
[rank0]:   File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/accelerate/accelerator.py", line 1654, in _prepare_deepspeed
[rank0]:     raise ValueError(
[rank0]: ValueError: When using DeepSpeed, `accelerate.prepare()` requires you to pass at least one of training or evaluation dataloaders with `batch_size` attribute returning an integer value or alternatively set an integer value in `train_micro_batch_size_per_gpu` in the deepspeed config file or assign integer value to `AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu']`.
E1124 16:30:03.665000 140389013079872 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 26925) of binary: /home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/bin/python
Traceback (most recent call last):
  File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())
  File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/export/base/ycsc_chenkh/chenkh/online1/support/anaconda3/envs/DreamLLM_Parrot/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-11-24_16:30:03
  host      : gpu004
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 26926)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-11-24_16:30:03
  host      : gpu004
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 26927)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-11-24_16:30:03
  host      : gpu004
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 26929)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-24_16:30:03
  host      : gpu004
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 26925)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Other Information

I have tried comment the line model_args, data_args, training_args = parser.parse_args_into_dataclasses() and no errors occurs.

In the DeepSpeed config file, I have set :

    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": 4,

I think there might be some parameters that conflict with each other, which leads to the problem.

Actually, I don't know whether or not this is a trl-related problem.

Hope someone can help me.

Expected behavior

No error occurs and everything works fine.

Checklist

coder109 commented 3 days ago

It is OK if you parse the parameters after loading DDPOTrainer(). But I'd like to know what causes these two unrelated functions to affect each other.

coder109 commented 3 days ago

I PROBABLY know the cause. If you have encountered with the same problem, please modify the source code of DDPOTrainer() like this:

        self.accelerator = Accelerator(
            log_with=self.config.log_with,
            mixed_precision=self.config.mixed_precision,
            project_config=accelerator_project_config,
            # we always accumulate gradients across timesteps; we want config.train.gradient_accumulation_steps to be the
            # number of *samples* we accumulate across, so we need to multiply by the number of training timesteps to get
            # the total number of optimizer steps to accumulate across.
            gradient_accumulation_steps=self.config.train_gradient_accumulation_steps * self.num_train_timesteps,
            **self.config.accelerator_kwargs,
        )

        # Accelerate MOD BEGIN
        self.accelerator.state.deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu'] = 4
        # Accelerate MOD END

It seems that DDPOTrainer() cannot properly load DeepSpeed config from external json files.