Closed switiz closed 1 year ago
Did you specify a shared memory size to make it work in docker?
Please show the error message together.
Dear @hyunwoongko
shm 16G 0 16G 0% /dev/shm
Need to turn on special logs for issues? Not printed Error log just hang... Currently, only general logs are attached. If you need to turn on a special log, please let me know and I will test it.
I change docker image from 21.06 to 21.07 (it include latest nccl version) issue is still reproduce.
[0] NVIDIA A100-SXM4-40GB | 42'C, 100 % | 15490 / 40536 MB | [1] NVIDIA A100-SXM4-40GB | 39'C, 100 % | 15490 / 40536 MB |
[2021-08-03 01:47:32,928] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-08-03 01:47:36,323] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 gpt_neo_deepspeed.py [2021-08-03 01:47:37,184] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.9.9 [2021-08-03 01:47:37,184] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1]} [2021-08-03 01:47:37,184] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=2, node_rank=0 [2021-08-03 01:47:37,184] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2021-08-03 01:47:37,184] [INFO] [launch.py:102:main] dist_world_size=2 [2021-08-03 01:47:37,184] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2021-08-03 01:52:32,666] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.4.2, git-hash=unknown, git-branch=unknown [2021-08-03 01:52:32,666] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-08-03 01:52:35,183] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.4.2, git-hash=unknown, git-branch=unknown [2021-08-03 01:52:35,184] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl Using /root/.cache/torch_extensions as PyTorch extensions root... Using /root/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/transformer_inference/build.ninja... Building extension module transformer_inference... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module transformer_inference... Loading extension module transformer_inference... Time to load transformer_inference op: 0.21926426887512207 seconds Time to load transformer_inference op: 0.21379852294921875 seconds DeepSpeed Transformer Inference config is DeepSpeed Transformer Inference config is {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is DeepSpeed Transformer Inference config is {'layer_id': 1, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}{'layer_id': 1, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is {'layer_id': 2, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 2, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 3, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 3, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 4, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 4, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 5, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 5, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 6, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 6, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 7, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 7, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 8, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 8, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 9, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 9, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is DeepSpeed Transformer Inference config is {'layer_id': 10, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}{'layer_id': 10, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is {'layer_id': 11, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 11, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 12, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 12, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 13, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 13, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 14, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 14, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 15, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 15, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 16, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 16, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 17, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 17, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 18, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 18, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 19, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 19, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 20, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 20, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 21, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 21, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 22, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 22, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 23, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 23, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 24, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 24, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 25, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 25, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 26, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 26, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 27, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 27, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 28, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 28, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 29, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 29, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256} DeepSpeed Transformer Inference config is {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
[{'generated_text': 'a-'}]##### max_len: 3
[{'generated_text': "a's testimony"}]
[{'generated_text': 'a. Field of'}]
[{'generated_text': 'a, or other\n'}]
[{'generated_text': 'a_n)$ and'}]
[{'generated_text': 'a, y técnic'}]
[{'generatedtext': 'a}^{\pm}{a'}]
[{'generated_text': 'a\nl\nc\nu\nl'}]
[{'generated_text': 'a\nu\np\nu\np\n'}]
[{'generatedtext': 'a{1},1)$ and $b_{'}]
[{'generated_text': 'a. I saw a girl in a bar, she was'}]
[{'generated_text': 'a>\n '}]
[{'generated_text': 'a\n,\n \n-\n3\n*\na\n '}]
[{'generated_text': 'a, the other members of the group are:\n\nA group formed by'}]
[{'generated_text': 'a and the other with two more, and one with one more. The average number'}]
[{'generated_text': 'a). While this result is based on the assumption that the electron and positron in the'}]
{'generated_text': 'a,b{ref-type="fig"}).\n\nMice'}]
[{'generated_text': 'a-z0-9]{0,1}\.\d{1,2}'}]
[{'generated_text': 'a\nt\ni\nv\ne\n \no\nf\n \nb\n('}]
[{'generated_text': 'a1: "Ia1" "a2: "A2",\n a'}]
[{'generated_text': 'a la situacija vladajuće su pozornici. Opozvao je'}]
[{'generated_text': 'a,b)=(v_1-v_2){\partial_z}\theta,\quad{\partial'}]
[{'generated_text': 'a\n \nm\nu\nl\nt\ni\np\nl\ne\n \no\nf'}]
[{'generated_text': "a\nt\n \ni\ns\n \nt\nh\ne\n \nd\n'\nt\n"}]
[{'generated_text': 'a, 0x01, 0x0d, 0x43, 0xf1, 0x0e, 0x7'}]
[{'generated_text': 'a ao longo dos anos, o alargamento da UE conta com uma forte política de'}]
[{'generated_text': 'a(2/17))(-5/14))*(-12/7)/((aa**6/a)/('}]
[{'generatedtext': 'a}{\alpha,\gamma}/p^{a}_{\alpha,\gamma}$ are also denoted by ${'}]
[{'generated_text': 'a7c5c5e_1\n- :distance: 325\n :file: de5e6615cf645fa9ed6'}]
[{'generated_text': 'a, who has the same type of relationship to him?"\n\n"Oh, he knows of such affairs of friendship; has written to him about them;'}]
[{'generated_text': 'a tome 5, vol. ix. p.\xa01013–1016 (cfr. 2.3.8)\n\n[25] Cf'}]##### max_len: 34
[{'generated_text': 'a a mãi mãi, \ntá na minha cabeça \ntão muito legal! \nE tão legal'}]
[{'generated_text': 'a\'s side." "I believe this is your car, sir?" "I\'m terribly sorry, Miss Karras." "He didn\'t mean to take it." "'}]
[{'generatedtext': 'a and c are also known as s , and k is also known as s.\n\nA common name for t is _'}]
[{'generated_text': 'a, Figure 2{ref-type="fig"}. Figure 1.The time course of the experimental protocol. Experimental sessions (n = 5) were performed'}]
[{'generated_text': 'a) (West 2004); Aufricht, 230 F.3d at 1265-66; United States v. Taveras, 156 F.3d 1234,'}]
[{'generated_text': 'a\nl\nu\ne\n?\n \n \n(\na\n)\n \n2\n1\n/\n1\n2\n8\n \n \n('}]
[{'generated_text': 'a2/5]{} (8,-4.5); (0,0); (12,-2.5); (0,2); (2,1.5); (8'}]
[{'generated_text': 'a-d]{.smallcaps}-Glycine, respectively. c~1~ in parentheses indicates the mass-balance constant related to chiral symmetry breaking and is a measure'}]
[{'generated_text': 'a\nt\n \ni\ns\n \nt\nh\ne\n \nr\ne\nm\na\ni\nn\nd\ne\nr\n \nw\n'}]
[{'generated_text': 'a\nl\nl\ne\ns\nt\n \nv\na\nl\nu\ne\n?\n \n \n(\na\n)\n \n-\n4\n7'}]
[{'generated_text': 'a)(4)(B)(ii)-(iii); and (e)(3)(B)-(C) of this section applies if the total amount of principal due under any of the terms is due to an employer whose'}]
[{'generated_text': 'a}\n=====================================\n\nThe primary data used in this analysis are from the 2007-08 Canadian Tobacco Use Study(CTUS07) administered to Canadian youth aged 12-17 years. The CTUS2007 study'}]
[{'generated_text': 'a; @shiraishi_2009; @fukuda_2009; @fukuda_2011; @fukuda_2010; @goto_2009; @goto_2010]. The $^{6'}]
[{'generatedtext': 'a-zA-Z]{3}\d|[^A-Za-zA-Z\d])|1[8-\d$]{}",\n "REGEX_'}]
[{'generated_text': 'a}(t,t-t_1)U^F\n(t-t_1)~.\n\label{eq:Lanadu2}$$ and Eq.\xa0(\[eq:Lan'}]
[{'generated_text': 'a, 0xa4, 0x62, 0xc1,\n\t0x4a, 0x8b, 0x5a, 0x9a, 0x7a, 0xc4, 0xf8,'}]
[{'generatedtext': 'a,t}^{a,t}({\mathcal{U}})+\mathcal{K}.$$ Then $$J(\mathbf{\tilde{u}})\leq \int{{\mathbb{R}^d'}]
[{'generated_text': 'a\n)\n \n1\n \n \n(\nb\n)\n \nv\n \n \n(\nc\n)\n \n-\n5\n\n\nc\n\n\nL\ne\nt\n \nx'}]
[{'generated_text': 'a-5p), which can be directly correlated with the expression levels (R^2^\u2009=\u20090.8023 with p-value\u2009\<\u20090.0001, see Fig.\xa0[2a]('}]
[{'generated_text': 'a\times b)}\le\frac{c_b}{C_b}+\frac{1}{(2{\varepsilon})^{1/4}},\end{aligned}$$ where $C_a$ and $C'}]
[{'generated_text': 'a7d2e4b2e5\n/Users/me/Projects/x1/x2/x3/x4/x5/x6/x7/x8/x9/x10/x11/x12'}]
[{'generated_text': 'a3, 0x0\n#define ixPHY_SPEC_CAP_STATUS1_DELAY_TH_A4 '}]
[{'generated_text': 'a.\n-1\nLet q be 3/2 + 12 + -7. Suppose -ql + 28 = 5. Suppose 3u - 23 + l = 0. Solve 2z - 2g - 26 = -6*g,'}]
{'generated_text': 'a{ref-type="fig"}. The mean age decreased significantly in both sexes from 24.75--23.75\u2009±\u20096.56 to 24.43--21.00\u2009±\u20095.38 (t�'}]
[{'generated_text': "a.\n\n3. I have never been in your dream! Now I get to play the part of a bad guy who gives you advice that you've never told anyone before! Here's a tip – just tell your friends who you are. Then when they find out they will know"}]
[{'generated_text': 'a>\n
I can't find any errors in the log, is the program just deadlocked and stuck?
yes, just deadlocked and not response
Dear Deepspeed
Is this any idea about this issue? Is there a point to add a log ? Is there normal work on your side?
thank you
Hi @switiz
Sorry for the late reply. I will investigate this and let you know how to solve it. Thanks, Reza
Okay, now I can repro this on seq-len 140, a bit farther than what you see! I will look more deeper into this and hopefully have a fix soon.
Thanks, Reza
Good news! I will wait for it to be fixed. If it is fixed, I will run the re-test again.
Thanks.
Hi @switiz
Can you please try to see if this branch solves the issue? By the way, I have changed your script a bit:
import os
import deepspeed
import torch
import transformers
from transformers import pipeline, AutoTokenizer
def init():
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline(
'text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.float,
replace_method='auto')
return generator
def predict(text, max_len):
torch.distributed.barrier()
with torch.no_grad():
string = generator(text, do_sample=True,
min_length=max_len,
max_length=max_len,
top_k=50,
temperature=1.0,
top_p=1.0,
num_return_sequences=1,
pad_token_id=3)
return string
if __name__ == '__main__':
generator = init()
text = 'a'
seq = 2023
for i in range(145, seq):
string = predict(text, i)
torch.distributed.barrier()
print(f'[{torch.distributed.get_rank()}] ##### max_len {i} : {string}')
Thanks, Reza
Hi @RezaYazdaniAminabadi
Of course. I will let you know the results after testing.
Thanks
Hi @RezaYazdaniAminabadi
I try to reproduce 10 times ([increase token 145 to 2022] * 10) with your fixed deepspeed repo(0.4.6+5038b07, 5038b07, reyazda/mp_inference)
issue is not reproduce.
There is a slight difference in inference speed when the barrier() added to the code and when not added. But the difference is up to xxx ms in long sequence generation, so it seems to be minor point.
thanks
Closing this issue as @RezaYazdaniAminabadi's branch was merged and seems to have solved the issue.
Description
Dear Deepspeed
I have some issue when using model parallel(inference engine) sometimes gpu utilization is fixed to 100% and code is hanged so i made test code and test deepspeed engine. here's my test code.
TestCode
DS_Report
ENV
issue is occur when input length is reached to 90 token. (may be it's randomly determined)
thank you