microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.29k stars 4.09k forks source link

[BUG] train_batch_size error caused by world size mismatch #4703

Closed nathan-az closed 11 months ago

nathan-az commented 11 months ago

I am using the deepspeed launcher with the HuggingFace Trainer in my script. My script makes zero mention of deepspeed or accelerate. My understanding was that the Trainer takes care of this.

I am running my job using deepspeed --hostfile ... scripts/run_sft.py <script_args>. I'm running on a 3 node cluster with passwordless SSH, each with 4 GPUs. The deepspeed config is being passed via --deepspeed= and the HfArgumentParser used to process the training args, passing the deepspeed config into the trainer.

The gist of my issue is this:

  File "/home/ubuntu/training_run/dl-training-scripts/scripts/run_sft.py", line 208, in main
    trainer = SFTTrainer(
              ^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/training_env/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 149, in __init__
    model = AutoModelForCausalLM.from_pretrained(model, **model_init_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/training_env/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/training_env/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3228, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/training_env/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 848, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/training_env/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 776, in __init__
    self._configure_train_batch_size()
  File "/home/ubuntu/miniconda3/envs/training_env/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 954, in _configure_train_batch_size
    self._batch_assertion()
  File "/home/ubuntu/miniconda3/envs/training_env/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 902, in _batch_assertion
    assert train_batch == micro_batch * grad_acc * self.world_size, (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 192 != 8 * 2 * 1

Note how the batch size set by transformers is 192 == 8 (per device size) 2 (gradient accumulation) 12 (args.world_size). But the deepspeed.runtime.config object has world_size 1.

It seems that train_batch_size is set by the transformers integration using args.world_size. This is causing the deepspeed assertion to fail.

After some painstaking debugging I think that my world size inside deepspeed.runtime.config is being set in the except block in deepspeed, because the dist variable is None, so the world_size is being set to 1, causing the mismatch. I believe that maybe th cdb isn't set yet? I believe it's NoneType which is what's triggering the exception. I don't know if this is correct and world size should stay as 1, or if the problem is in transformers and args.world_size (set by the deepspeed launcher I believe) should not be respected in this context.

My deepspeed config does not specify any batch sizes, letting the Trainer handle these to avoid conflict:

    "gradient_accumulation_steps": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",

ds_report output:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/miniconda3/envs/training_env/lib/python3.11/site-packages/torch']
torch version .................... 2.1.1+cu121
deepspeed install path ........... ['/home/ubuntu/miniconda3/envs/training_env/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.2, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 93.35 GB

System info (please complete the following information):

Launcher context deepspeed launcher with hostfile The launcher appears to be launching and distributing successfully:


[2023-11-18 04:27:25,908] [INFO] [launch.py:151:main] nnodes=3, num_local_procs=4, node_rank=0
[2023-11-18 04:27:25,908] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'<network_prefix>.7': [0, 1, 2, 3], '<network_prefix>.11': [4, 5, 6, 7], '<network_prefix>.60': [8, 9, 10, 11]})```
nathan-az commented 11 months ago

Closing this, although frankly I have no idea what I did to fix it, I think the problem was on the transformers/accelerate side.

ziyi-yang commented 9 months ago

Hi Nathan, may I know how you fixed this issue on your side eventually? Got the same bug here... thanks!

nathan-az commented 9 months ago

Hey @ziyi-yang , unfortunately I never worked out what the issue was.

I swapped to using accelerate to launch my training job, while using deepspeed with the pdsh launcher, which should work very similarly (if not identically).

It's possible that I was incorrectly launching the job, and accelerate abstracted away something I was doing wrong. If you're interested in trying to use accelerate to launch the job, the relevant config snippet that I'm using is:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: pdsh
  deepspeed_hostfile: {ds_hostfile_path}
  deepspeed_config_file: {ds_config_path}
  zero3_init_flag: true
distributed_type: DEEPSPEED

Note that in my case since I'm using PDSH, I only run this command on the main process. Let me know if you have any specific questions about my setup, and sorry I can't be of more help!

ziyi-yang commented 9 months ago

Not at all, really appreciate your quick and detailed reply.

Elenore1997 commented 7 months ago

Hey @ziyi-yang , unfortunately I never worked out what the issue was.

I swapped to using accelerate to launch my training job, while using deepspeed with the pdsh launcher, which should work very similarly (if not identically).

It's possible that I was incorrectly launching the job, and accelerate abstracted away something I was doing wrong. If you're interested in trying to use accelerate to launch the job, the relevant config snippet that I'm using is:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: pdsh
  deepspeed_hostfile: {ds_hostfile_path}
  deepspeed_config_file: {ds_config_path}
  zero3_init_flag: true
distributed_type: DEEPSPEED

Note that in my case since I'm using PDSH, I only run this command on the main process. Let me know if you have any specific questions about my setup, and sorry I can't be of more help!

hi, i also encountered this batch num issue when using ds zero3 to launch the job, while zero2 works fine. could u please share the config file and the launch script using accelerate? thanks in advance!

nathan-az commented 7 months ago

Hey @Elenore1997 , I don't have my versions anymore.

However the following links may be useful for you, especially if you're interested in training qlora with limited hardware: deepspeed, fsdp.

Hope those are helpful - they both provide example scripts with accelerate configs. If you need a more customised deepspeed setup you can also choose to link a deepspeed json config in your accelerate yaml, rather than using the limited options available directly on the accelerate yaml.

Note that I saw slightly lower memory usage using FSDP rather than deepspeed, and was able to finetune a 70B qlora on 4x 24GB GPUs without any CPU offloading (although surprisingly, enabling CPU offloading yielded faster training).

Sorry I can't provide my specific examples, but let me know if you have further questions.

Elenore1997 commented 7 months ago

Hey @Elenore1997 , I don't have my versions anymore.

However the following links may be useful for you, especially if you're interested in training qlora with limited hardware: deepspeed, fsdp.

Hope those are helpful - they both provide example scripts with accelerate configs. If you need a more customised deepspeed setup you can also choose to link a deepspeed json config in your accelerate yaml, rather than using the limited options available directly on the accelerate yaml.

Note that I saw slightly lower memory usage using FSDP rather than deepspeed, and was able to finetune a 70B qlora on 4x 24GB GPUs without any CPU offloading (although surprisingly, enabling CPU offloading yielded faster training).

Sorry I can't provide my specific examples, but let me know if you have further questions.

it's nice of you! thanks for your quick reply, i will try the links you provided.