huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.75k stars 26.95k forks source link

[Bug] Error when trying to run two models in a machine with created ZeRO config variables. #30323

Closed jacklanda closed 4 months ago

jacklanda commented 6 months ago

System Info

Who can help?

@ArthurZucker @younesbelkada @pacman100

Information

Tasks

Reproduction

Error Messages

deepspeed --num_gpus=4 test.py --deepspeed deepspeed_config_zero3_without_offload.json
[2024-04-18 23:48:52,671] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:48:53,410] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-04-18 23:48:53,440] [INFO] [runner.py:568:main] cmd = /home/ivanfung/miniforge3/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test.py --deepspeed deepspeed_config_zero3_without_offload.json
[2024-04-18 23:48:55,322] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:48:55,784] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-04-18 23:48:55,784] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-04-18 23:48:55,784] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-04-18 23:48:55,784] [INFO] [launch.py:163:main] dist_world_size=4
[2024-04-18 23:48:55,784] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-04-18 23:48:55,785] [INFO] [launch.py:253:main] process 493390 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=0', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
[2024-04-18 23:48:55,786] [INFO] [launch.py:253:main] process 493391 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=1', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
[2024-04-18 23:48:55,787] [INFO] [launch.py:253:main] process 493392 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=2', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
[2024-04-18 23:48:55,788] [INFO] [launch.py:253:main] process 493393 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 3}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 2}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 0}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 1}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:06<00:00,  3.15s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:06<00:00,  3.19s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:06<00:00,  3.18s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:06<00:00,  3.15s/it]
Map (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13883/13883 [00:00<00:00, 21859.95 examples/s]
Map (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13883/13883 [00:00<00:00, 18743.57 examples/s]
Map (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13883/13883 [00:00<00:00, 19966.79 examples/s]
Map (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13883/13883 [00:00<00:00, 22316.04 examples/s]
Filter (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13883/13883 [00:00<00:00, 37141.84 examples/s]
Filter (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13883/13883 [00:00<00:00, 38661.50 examples/s]
Filter (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13883/13883 [00:00<00:00, 35041.42 examples/s]
Filter (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13883/13883 [00:00<00:00, 38018.21 examples/s]
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Map (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1752/1752 [00:00<00:00, 4166.98 examples/s]
Map (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1752/1752 [00:00<00:00, 4280.46 examples/s]
Map (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1752/1752 [00:00<00:00, 4181.90 examples/s]
Filter (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1752/1752 [00:00<00:00, 5585.31 examples/s]
Map (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1752/1752 [00:00<00:00, 4121.15 examples/s]
Filter (num_proc=32):  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž         | 1536/1752 [00:00<00:00, 8288.97 examples/s]Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:19 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
Filter (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1752/1752 [00:00<00:00, 5868.96 examples/s]
Filter (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1752/1752 [00:00<00:00, 5691.31 examples/s]
Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:19 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:19 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
Filter (num_proc=32):   0%|                                                                                            | 0/1752 [00:00<?, ? examples/s][2024-04-18 23:49:19,923] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Filter (num_proc=32):  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹                                     | 935/1752 [00:00<00:00, 4852.83 examples/s][2024-04-18 23:49:20,050] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-18 23:49:20,123] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Filter (num_proc=32): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1752/1752 [00:00<00:00, 5395.09 examples/s]
[2024-04-18 23:49:20,171] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:49:20,251] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-18 23:49:20,252] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-04-18 23:49:20,303] [INFO] [comm.py:637:init_distributed] cdb=None
Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:20 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
[2024-04-18 23:49:21,018] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:49:21,146] [INFO] [comm.py:637:init_distributed] cdb=None
trainer.train
trainer.train
trainer.train
trainer.train
hpZeRO group size: 4
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 4.996, 'grad_norm': 166.72275161005012, 'learning_rate': 3.125e-06, 'epoch': 0.09}
{'loss': 2.4735, 'grad_norm': 38.799148163513834, 'learning_rate': 6.25e-06, 'epoch': 0.18}
{'loss': 2.0208, 'grad_norm': 36.615276885153875, 'learning_rate': 9.375000000000001e-06, 'epoch': 0.28}
{'loss': 1.6564, 'grad_norm': 9.72972914196507, 'learning_rate': 9.844054580896686e-06, 'epoch': 0.37}
{'loss': 1.41, 'grad_norm': 9.066500374092316, 'learning_rate': 9.649122807017545e-06, 'epoch': 0.46}
 10%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                                     | 54/545 [01:18<11:25,  1.40s/itTraceback (most recent call last):β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 55/55 [00:22<00:00,  2.44it/s]
  File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
    train(args)
  File "/home/ivanfung/workspace/bug/test.py", line 340, in train
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
    output = eval_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
    bs_f1 = BERT_SCORER.compute(
  File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
    self.cached_bertscorer = scorer(
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
    self._model = get_model(self.model_type, self.num_layers, self.all_layers)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
    model = AutoModel.from_pretrained(model_type)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
    groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
    assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
Traceback (most recent call last):
  File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
    train(args)
  File "/home/ivanfung/workspace/bug/test.py", line 340, in train
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
    output = eval_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
    bs_f1 = BERT_SCORER.compute(
  File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
    self.cached_bertscorer = scorer(
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
    self._model = get_model(self.model_type, self.num_layers, self.all_layers)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
    model = AutoModel.from_pretrained(model_type)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
    groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
    assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
Traceback (most recent call last):
  File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
    train(args)
  File "/home/ivanfung/workspace/bug/test.py", line 340, in train
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
    output = eval_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
    bs_f1 = BERT_SCORER.compute(
  File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
    self.cached_bertscorer = scorer(
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
    self._model = get_model(self.model_type, self.num_layers, self.all_layers)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
    model = AutoModel.from_pretrained(model_type)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
    groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
    assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
Traceback (most recent call last):
  File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
    train(args)
  File "/home/ivanfung/workspace/bug/test.py", line 340, in train
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
    output = eval_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
    bs_f1 = BERT_SCORER.compute(
  File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
    self.cached_bertscorer = scorer(
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
    self._model = get_model(self.model_type, self.num_layers, self.all_layers)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
    model = AutoModel.from_pretrained(model_type)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
    groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
    assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
[2024-04-18 23:51:25,958] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493390
[2024-04-18 23:51:26,535] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493391
[2024-04-18 23:51:26,595] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493392
[2024-04-18 23:51:26,595] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493393
[2024-04-18 23:51:26,783] [ERROR] [launch.py:322:sigkill_handler] ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_zero3_without_offload.json'] exits with return code = 1

Dataset

I used the following data examples for training and validation in this error reproduction procedure. Please download and use the command unzip dataset.zip to decompress it. dataset.zip

Steps for Reproduction

  1. Create DeepSpeed config deepspeed_config_zero3_without_offload.json shown as the following:

    {
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "zero_hpz_partition_size": 8,
        "reduce_bucket_size": 10000000,
        "reduce_scatter": true,
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
    }
  2. Use DeepSpeed to run a training script test.py that imports models with statements like AutoModel.from_pretrained(...) in Hugging Face.

    
    # -*- coding: utf-8 -*-

import os import sys import json import glob import logging import argparse import warnings from typing import List, Dict, Optional

import torch import transformers from evaluate import load from datasets import load_dataset from sentence_transformers import SentenceTransformer, util from transformers import ( LlamaForCausalLM, LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer, ) from transformers.tokenization_utils_base import BatchEncoding

warnings.filterwarnings("ignore")

TOKENIZER = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") TOKENIZER.pad_token_id = 0 TOKENIZER.bos_token_id = 1 TOKENIZER.eos_token_id = 2

BERT_SCORER = load("bertscore")

def preprocess_logits_for_metrics(logits, labels): """ Original Trainer may cause OOM issue. This is a workaround to avoid storing too many tensors that are not needed. """ pred_ids = torch.argmax(logits, dim=-1) return pred_ids, labels

def compute_metrics(eval_preds): """Compute metrics for evaluation.""" pred_ids = eval_preds.predictions[0] labels_ids = eval_preds.label_ids if isinstance(pred_ids, tuple): pred_ids = pred_ids[0]

pred_ids[pred_ids == -100] = TOKENIZER.pad_token_id
pred_str = TOKENIZER.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = TOKENIZER.pad_token_id
label_str = TOKENIZER.batch_decode(labels_ids, skip_special_tokens=True)

# compute BERTScore F1
bs_f1 = BERT_SCORER.compute(
    predictions=pred_str,
    references=label_str,
    lang="en",
    nthreads=16,
    device="cuda:3",
)["f1"][0]

# return {"rouge-l": round(rouge_l, 4) * 100}
return {
    "bertscore-f1": round(bs_f1, 4) * 100,
}

def get_logger(logger_name: str, output_dir: str) -> logging.Logger: """Initialize logger.""" logger = logging.getLogger(logger_name) logger.setLevel(logging.DEBUG) os.makedirs(output_dir, exist_ok=True) file_handler = logging.FileHandler( os.path.join(output_dir, "log.txt"), mode="w") file_handler.setLevel(logging.INFO) file_handler.setFormatter( logging.Formatter( fmt="%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s", datefmt="%Y-%m-%d %H:%M:%S", ) ) logger.addHandler(file_handler) console_handler = logging.StreamHandler() console_handler.setLevel(logging.INFO) console_handler.setFormatter( logging.Formatter( fmt="%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s", datefmt="%Y-%m-%d %H:%M:%S", ) ) logger.addHandler(console_handler)

return logger

def train(args: argparse.Namespace) -> None: """Training entry for supervised fine-tuning.""" model_config = { "batch_size": 128, "num_epochs": 5, "per_device_train_batch_size": 32, "eval_times": 10, "warmup_rate": 0.06, "gradient_accumulation_steps": 1, } model_type = "llama" model_name_or_path = "meta-llama/Llama-2-7b-chat-hf" data_path_train = "./train.jsonl" data_path_valid = "./valid.jsonl" output_dir = "./output" max_seq_len = 128

logger = get_logger("train", "output")
logger.info("args.__dict__ : {}".format(args.__dict__))

assert (
    model_name_or_path
), "Please specify a --base_model, e.g. --base_model='decapoda-research/llama-7b-hf'"

gradient_accumulation_steps = (
    model_config["batch_size"] // model_config["per_device_train_batch_size"]
    if "gradient_accumulation_steps" not in model_config
    else model_config["gradient_accumulation_steps"]
)

logger.info(
    "per_device_train_batch_size = {}, gradient_accumulation_steps = {}".format(
        model_config["per_device_train_batch_size"], gradient_accumulation_steps
    )
)
device_map = None
world_size = int(
    os.environ.get("WORLD_SIZE", 1)
)  # `world_size` is corresponding to the number of GPUs
ddp = world_size != 1
if ddp:
    device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)}
    gradient_accumulation_steps = max(
        gradient_accumulation_steps // world_size, 1)

# load model and tokenizer for LLaMA and its variants
model = LlamaForCausalLM.from_pretrained(
    model_name_or_path,
    device_map=device_map,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path)
tokenizer.pad_token_id = 0
tokenizer.bos_token_id = 1
tokenizer.eos_token_id = 2

def tokenize(
    input_text: str, target_text: str, add_eos_token: bool = True
) -> Dict[str, str]:
    """Tokenize for the given prompt and convert input prompt to input_ids, attention_mask and labels' ids."""
    result = dict()
    inputs = tokenizer(
        input_text,
        truncation=False,
        max_length=max_seq_len,
        padding=False,
        return_tensors=None,
    )
    targets = tokenizer(
        target_text,
        truncation=False,
        max_length=max_seq_len,
        padding=False,
        return_tensors=None,
    )
    inputs_len = len(inputs["input_ids"])
    targets_len = len(targets["input_ids"])

    # (1) len of inputs + len of targets < max_seq_len
    if inputs_len + targets_len < max_seq_len:
        result["input_ids"] = inputs["input_ids"] + targets["input_ids"]
        result["attention_mask"] = (
            inputs["attention_mask"] + targets["attention_mask"]
        )
    # (2) len of inputs + len of targets >= max_seq_len, shrink the length of inputs
    elif inputs_len + targets_len >= max_seq_len:
        inputs_len = max_seq_len - targets_len - 1
        result["input_ids"] = (
            inputs["input_ids"][:inputs_len] + targets["input_ids"]
        )
        result["attention_mask"] = (
            inputs["attention_mask"][:inputs_len] +
            targets["attention_mask"]
        )

    if inputs_len <= 8:
        print(
            f"[DROP] `inputs_len` should be greater than 30 in input of data point: {input_text}."
        )
        return {"input_ids": [], "attention_mask": [], "labels": []}

    # Add token "eos"
    if (
        result["input_ids"][-1] != tokenizer.eos_token_id
        and len(result["input_ids"]) < max_seq_len
        and add_eos_token
    ):
        result["input_ids"].append(tokenizer.eos_token_id)
        result["attention_mask"].append(1)

    if add_eos_token and len(result["input_ids"]) >= max_seq_len:
        result["input_ids"][max_seq_len - 1] = tokenizer.eos_token_id
        result["attention_mask"][max_seq_len - 1] = 1

    # Construct labels, ignore the loss computing for tokens of prompt by assigning them with -100
    result["labels"] = [-100] * inputs_len + \
        result["input_ids"][inputs_len:].copy()

    if len(result["input_ids"]) != len(result["labels"]):
        print(
            f"[DROP] Length mismatch between `input_ids` and `labels` in {input_text}!"
        )
        return {"input_ids": [], "attention_mask": [], "labels": []}

    return result

def generate_and_tokenize_prompt(datapoint) -> Dict[str, str]:
    """Generate and construct prompt constrained by a fixed size of window.
    Dynamically generate input sequence and target sequence for each training example.
    """
    input_text = (
        datapoint["instruction"] + "\n\n"
    )  # no use prefix of prompt for fine-tuning
    input_text = (
        tokenizer.bos_token + input_text
        if tokenizer.bos_token is not None
        else input_text
    )  # Add token bos if exists
    target_text = (
        datapoint["definition"] + tokenizer.eos_token
        if tokenizer.eos_token is not None
        else datapoint["definition"]
    )  # Add token eos if exists

    # Check the length of input_text and target_text
    if len(input_text.split()) + len(target_text.split()) <= max_seq_len:
        return tokenize(input_text, target_text)
    else:
        print(
            f"[DROP] Length of `input_text` ⨁ `target_text` should be less than {max_seq_len} in data point: {input_text}."
        )
        return {"input_ids": [], "attention_mask": [], "labels": []}

data_train = load_dataset("json", data_files=data_path_train)["train"]
training_nums = len(data_train)

# tokenize datapoints for training set
train_data = (
    data_train.shuffle()
    .map(generate_and_tokenize_prompt, num_proc=32, keep_in_memory=True)
    .filter(lambda x: len(x["input_ids"]) > 0, num_proc=32, keep_in_memory=True)
)
print(
    f"Disagreement of input vs. target of training data: {str(len([len(d['input_ids']) != len(d['labels']) for d in train_data]))}"
)
logger.info("Tokenizing training set success!")
if os.path.isfile(data_path_valid):
    data_valid = load_dataset("json", data_files=data_path_valid)["train"]
    # tokenize datapoints for validation set
    val_data = (
        data_valid.shuffle()
        .map(generate_and_tokenize_prompt, num_proc=32, keep_in_memory=True)
        .filter(lambda x: len(x["input_ids"]) > 0, num_proc=32, keep_in_memory=True)
    )
else:
    val_data = None
print(
    f"Disagreement of input vs target of valid data: {str(len([len(d['input_ids']) != len(d['labels']) for d in val_data]))}"
)

print("***** Start Training *****")
num_gpus = torch.cuda.device_count()
total_steps = (
    training_nums // (gradient_accumulation_steps *
                      model_config["per_device_train_batch_size"] * num_gpus) + 1
) * model_config["num_epochs"]
eval_interval_steps = save_interval_steps = total_steps // model_config["eval_times"]
warmup_steps = int(total_steps * model_config.get("warmup_rate", 0.06))
logger.info(
    "num_gpus = {}, training_nums = {}, total_steps = {}, warmup_steps = {}".format(
        num_gpus, training_nums, total_steps, warmup_steps
    )
)
trainer = transformers.Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    compute_metrics=compute_metrics,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=model_config["per_device_train_batch_size"],
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=warmup_steps,
        adam_beta1=0.9,
        adam_beta2=0.95,
        weight_decay=0.01,
        num_train_epochs=model_config["num_epochs"],
        learning_rate=1e-5,
        lr_scheduler_type="linear",
        bf16=True,
        tf32=True,
        gradient_checkpointing=True,
        logging_dir="logs/tensorboard",
        logging_steps=10,
        evaluation_strategy="steps",
        eval_steps=eval_interval_steps,
        save_steps=eval_interval_steps,
        output_dir=output_dir,
        report_to=None,
        save_total_limit=2,
        load_best_model_at_end=True,
        ddp_find_unused_parameters=False if ddp else None,
        deepspeed=(
            args.deepspeed if args.deepspeed else None
        ),
        group_by_length=True,
    ),
    data_collator=transformers.DataCollatorForSeq2Seq(
        tokenizer,
        pad_to_multiple_of=8,
        return_tensors="pt",
        padding=True,
    ),
)

model.config.use_cache = False

if torch.__version__ >= "2" and sys.platform != "win32":
    model = torch.compile(model)
print("trainer.train")
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
logger.info("***** Checkpointing *****")

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
# save tokenizer for each detected checkpoint directory in output_dir
for checkpoint_dir in glob.glob(os.path.join(output_dir, "checkpoint-*")):
    try:
        tokenizer.save_pretrained(checkpoint_dir)
    except:
        pass

logger.info("Training succeeded")

if name == "main": parser = argparse.ArgumentParser() parser.add_argument("--deepspeed", type=str, help="deepspeed config") parser.add_argument( "--resume_from_checkpoint", action="store_true", help="either training checkpoint or final adapter", ) parser.add_argument("--local_rank", type=int) args = parser.parse_args()

train(args)

3. Run training with DeepSpeed
```bash
deepspeed --num_gpus=4 test.py --deepspeed deepspeed_config_zero3_without_offload.json

Expected behavior

It should allow me to compute the BERT Score using a GPU device in the interval training procedure, rather than throwing an error message.

jacklanda commented 6 months ago

Any thoughts on this?

@pacman100

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jacklanda commented 5 months ago

πŸ€”

ArthurZucker commented 5 months ago

Sorry @jacklanda, I think @muellerzr and @SunMarc will replace @pacman100 on such issues! If one of you can have a look!

jacklanda commented 4 months ago

Are there any thoughts on it?

muellerzr commented 4 months ago

At this time we do not support multiple models with deepspeed, please see: https://github.com/huggingface/accelerate/issues/2496

jacklanda commented 4 months ago

At this time we do not support multiple models with deepspeed, please see: huggingface/accelerate#2496

I see. Thanks for your message :)