guanhua@guanhua-Lambda:~/DiscQuant$ deepspeed test_hf_ds.py
[2024-09-06 15:53:29,210] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:29,660] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:30,664] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-09-06 15:53:30,664] [INFO] [runner.py:585:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test_hf_ds.py
[2024-09-06 15:53:32,031] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:32,476] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:33,468] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-06 15:53:33,468] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-06 15:53:33,468] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-06 15:53:33,468] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-06 15:53:33,468] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-06 15:53:33,469] [INFO] [launch.py:256:main] process 513898 spawned with command: ['/usr/bin/python3', '-u', 'test_hf_ds.py', '--local_rank=0']
[2024-09-06 15:53:33,469] [INFO] [launch.py:256:main] process 513899 spawned with command: ['/usr/bin/python3', '-u', 'test_hf_ds.py', '--local_rank=1']
[2024-09-06 15:53:34,951] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:34,990] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:35,366] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:35,401] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:54:00,929] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2024-09-06 15:54:00,930] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
Traceback (most recent call last):
File "/home/guanhua/DiscQuant/test_hf_ds.py", line 47, in <module>
model = AutoModel.from_pretrained("openai-community/gpt2")
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3821, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
_ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 798, in __init__
Traceback (most recent call last):
File "/home/guanhua/DiscQuant/test_hf_ds.py", line 47, in <module>
self._configure_train_batch_size()
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 981, in _configure_train_batch_size
model = AutoModel.from_pretrained("openai-community/gpt2")
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
self._batch_assertion()
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 929, in _batch_assertion
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3821, in from_pretrained
assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
_ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 798, in __init__
self._configure_train_batch_size()
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 981, in _configure_train_batch_size
self._batch_assertion()
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 929, in _batch_assertion
assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1
[2024-09-06 15:54:01,510] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 513898
[2024-09-06 15:54:01,532] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 513899
[2024-09-06 15:54:01,532] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'test_hf_ds.py', '--local_rank=1'] exits with return code = 1
I think the root cause is because [config.py:733:__init__] Config mesh_device None world_size = 1, somehow ds_init did not pass the correct mesh_device argument which makes world_size=1 (correct should be 2).
To reproduce, below is the python script I am using, cmd is deepspeed --num_gpus 2 BELOW_PYTHON.py
I am using ds 0.15.1 on two A6000 GPUs, following the huggingface Non-Trainer DeepSpeed integration,
got assertion error:
I think the root cause is because
[config.py:733:__init__] Config mesh_device None world_size = 1
, somehow ds_init did not pass the correctmesh_device
argument which makes world_size=1 (correct should be 2).To reproduce, below is the python script I am using, cmd is
deepspeed --num_gpus 2 BELOW_PYTHON.py