microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.47k stars 4.12k forks source link

[BUG] Config mesh_device None #6501

Open GuanhuaWang opened 2 months ago

GuanhuaWang commented 2 months ago

I am using ds 0.15.1 on two A6000 GPUs, following the huggingface Non-Trainer DeepSpeed integration,

got assertion error:

guanhua@guanhua-Lambda:~/DiscQuant$ deepspeed test_hf_ds.py
[2024-09-06 15:53:29,210] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:29,660] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:30,664] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-09-06 15:53:30,664] [INFO] [runner.py:585:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test_hf_ds.py
[2024-09-06 15:53:32,031] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:32,476] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:33,468] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-06 15:53:33,468] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-06 15:53:33,468] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-06 15:53:33,468] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-06 15:53:33,468] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-06 15:53:33,469] [INFO] [launch.py:256:main] process 513898 spawned with command: ['/usr/bin/python3', '-u', 'test_hf_ds.py', '--local_rank=0']
[2024-09-06 15:53:33,469] [INFO] [launch.py:256:main] process 513899 spawned with command: ['/usr/bin/python3', '-u', 'test_hf_ds.py', '--local_rank=1']
[2024-09-06 15:53:34,951] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:34,990] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:35,366] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:35,401] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:54:00,929] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2024-09-06 15:54:00,930] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
Traceback (most recent call last):
  File "/home/guanhua/DiscQuant/test_hf_ds.py", line 47, in <module>
    model = AutoModel.from_pretrained("openai-community/gpt2")
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3821, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 798, in __init__
Traceback (most recent call last):
      File "/home/guanhua/DiscQuant/test_hf_ds.py", line 47, in <module>
self._configure_train_batch_size()
  File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 981, in _configure_train_batch_size
    model = AutoModel.from_pretrained("openai-community/gpt2")
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    self._batch_assertion()
  File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 929, in _batch_assertion
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3821, in from_pretrained
    assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 798, in __init__
    self._configure_train_batch_size()
  File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 981, in _configure_train_batch_size
    self._batch_assertion()
  File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 929, in _batch_assertion
    assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1
[2024-09-06 15:54:01,510] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 513898
[2024-09-06 15:54:01,532] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 513899
[2024-09-06 15:54:01,532] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'test_hf_ds.py', '--local_rank=1'] exits with return code = 1

I think the root cause is because [config.py:733:__init__] Config mesh_device None world_size = 1, somehow ds_init did not pass the correct mesh_device argument which makes world_size=1 (correct should be 2).

To reproduce, below is the python script I am using, cmd is deepspeed --num_gpus 2 BELOW_PYTHON.py

from transformers.integrations import HfDeepSpeedConfig
from transformers import AutoModel
import deepspeed

ds_config = {
  #"fp16": {
  #  "enabled": "auto",
  #  "loss_scale": 0,
  #  "loss_scale_window": 1000,
  #  "initial_scale_power": 16,
  #  "hysteresis": 2,
  #  "min_loss_scale": 1
  #},
  "bf16": {
    "enabled": "auto"
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": True,
    "contiguous_gradients": True,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "gather_16bit_weights_on_model_save": True
  },
  "gradient_accumulation_steps": 1,
  "gradient_clipping": 1.0,
  "train_batch_size": 2,
  "train_micro_batch_size_per_gpu": 1,
  "steps_per_print": 1e5,
  "wall_clock_breakdown": False,
  "data_parallel_size": 2
}

ds_cf = HfDeepSpeedConfig(ds_config)
model = AutoModel.from_pretrained("openai-community/gpt2")
engine = deepspeed.initialize(model=model, config_params=ds_config, dist_init_required=True)
AetherPrior commented 1 week ago

Did you manage a fix here?