[BUG] deepspeed inferene opt/66b model OOM on 8GPUs

My deepspeed is 0.8.1 and transformers is 4.21.2 and I have 8 V100 32GB on my machine

I try the code ds_inference and set the mp_size to be 2/4/8.

The below is my error logs for mp_size = 4/8.


03-03 03:41:09,338] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2023-03-03 03:41:09,339] [WARNING] [config_utils.py:74:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-03 03:41:09,339] [WARNING] [config_utils.py:74:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-03 03:41:09,339] [WARNING] [config_utils.py:74:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-03 03:41:09,339] [WARNING] [config_utils.py:74:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-03 03:41:09,339] [WARNING] [config_utils.py:74:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-03 03:41:09,339] [WARNING] [config_utils.py:74:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-03 03:41:09,339] [WARNING] [config_utils.py:74:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-03 03:41:09,340] [WARNING] [config_utils.py:74:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-03 03:41:09,340] [INFO] [logging.py:75:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/lambda7xx/.cache/torch_extensions/py38_cu117/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.5223760604858398 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.5085861682891846 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.5094120502471924 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.5131995677947998 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.5160329341888428 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.5114881992340088 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.5109550952911377 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.6042768955230713 seconds
[2023-03-03 03:41:10,919] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 9216, 'intermediate_size': 36864, 'heads': 72, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 4, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.ReLU: 2>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False}
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.11136794090270996 seconds
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.10508179664611816 seconds
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.10427236557006836 seconds
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.1050100326538086 seconds
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.10665178298950195 seconds
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.10624575614929199 seconds
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.1120920181274414 seconds
Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.10815262794494629 seconds

Loading 14 checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s] Loading 14 checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s] Loading 14 checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s] Loading 14 checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s] Loading 14 checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s] Loading 14 checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s] Loading 14 checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s] Loading 14 checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s]Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 561, in replace_transformer_layer load_model_with_checkpoint(replaced_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 279, in load_model_with_checkpoint load_module_recursive(r_module) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 271, in load_module_recursive layer_policies[child.class](child, prefix + name + '.') File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 200, in load_transformer_layer replace_policy.load_params(module, AttributeError: 'HFOPTLayerPolicy' object has no attribute 'load_params' Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 561, in replace_transformer_layer load_model_with_checkpoint(replaced_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 279, in load_model_with_checkpoint load_module_recursive(r_module) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 271, in load_module_recursive layer_policies[child.class](child, prefix + name + '.') File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 200, in load_transformer_layer replace_policy.load_params(module, AttributeError: 'HFOPTLayerPolicy' object has no attribute 'load_params'

Loading 14 checkpoint shards: 0%| | 0/14 [00:05<?, ?it/s] Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 561, in replace_transformer_layer load_model_with_checkpoint(replaced_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 279, in load_model_with_checkpoint load_module_recursive(r_module) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 271, in load_module_recursive layer_policies[child.class](child, prefix + name + '.') File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 200, in load_transformer_layer replace_policy.load_params(module, AttributeError: 'HFOPTLayerPolicy' object has no attribute 'load_params' Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 561, in replace_transformer_layer load_model_with_checkpoint(replaced_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 279, in load_model_with_checkpoint load_module_recursive(r_module) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 271, in load_module_recursive layer_policies[child.class](child, prefix + name + '.') File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 200, in load_transformer_layer replace_policy.load_params(module, AttributeError: 'HFOPTLayerPolicy' object has no attribute 'load_params'

Loading 14 checkpoint shards: 0%| | 0/14 [00:05<?, ?it/s]

Loading 14 checkpoint shards: 0%| | 0/14 [00:06<?, ?it/s]

Loading 14 checkpoint shards: 0%| | 0/14 [00:06<?, ?it/s] PHLRR4036:4558:5025 [4] NCCL INFO [Service thread] Connection closed by localRank 4 Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 561, in replace_transformer_layer load_model_with_checkpoint(replaced_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 279, in load_model_with_checkpoint load_module_recursive(r_module) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 271, in load_module_recursive layer_policies[child.class](child, prefix + name + '.') File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 200, in load_transformer_layer replace_policy.load_params(module, AttributeError: 'HFOPTLayerPolicy' object has no attribute 'load_params' PHLRR4036:4558:4558 [4] NCCL INFO comm 0x4a158970 rank 4 nranks 8 cudaDev 4 busId 83000 - Abort COMPLETE Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 561, in replace_transformer_layer load_model_with_checkpoint(replaced_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 279, in load_model_with_checkpoint load_module_recursive(r_module) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 273, in load_module_recursive load_module_recursive( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 271, in load_module_recursive layer_policies[child.class](child, prefix + name + '.') File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 200, in load_transformer_layer replace_policy.load_params(module, AttributeError: 'HFOPTLayerPolicy' object has no attribute 'load_params' PHLRR4036:4556:5028 [3] NCCL INFO [Service thread] Connection closed by localRank 3 PHLRR4036:4556:4556 [3] NCCL INFO comm 0x4a9c1610 rank 3 nranks 8 cudaDev 3 busId 13000 - Abort COMPLETE

Loading 14 checkpoint shards: 0%| | 0/14 [00:06<?, ?it/s] PHLRR4036:4560:5029 [5] NCCL INFO [Service thread] Connection closed by localRank 5 PHLRR4036:4560:4560 [5] NCCL INFO comm 0x498b20b0 rank 5 nranks 8 cudaDev 5 busId 89000 - Abort COMPLETE

Loading 14 checkpoint shards: 0%| | 0/14 [00:06<?, ?it/s] PHLRR4036:4553:5017 [0] NCCL INFO [Service thread] Connection closed by localRank 0 PHLRR4036:4553:4553 [0] NCCL INFO comm 0x4b092180 rank 0 nranks 8 cudaDev 0 busId 5000 - Abort COMPLETE PHLRR4036:4554:5024 [1] NCCL INFO [Service thread] Connection closed by localRank 1 PHLRR4036:4554:4554 [1] NCCL INFO comm 0x4a1b1a70 rank 1 nranks 8 cudaDev 1 busId 8000 - Abort COMPLETE PHLRR4036:4564:5026 [7] NCCL INFO [Service thread] Connection closed by localRank 7 PHLRR4036:4564:4564 [7] NCCL INFO comm 0x4a6a3370 rank 7 nranks 8 cudaDev 7 busId 91000 - Abort COMPLETE PHLRR4036:4562:5027 [6] NCCL INFO [Service thread] Connection closed by localRank 6 PHLRR4036:4555:5023 [2] NCCL INFO [Service thread] Connection closed by localRank 2 PHLRR4036:4562:4562 [6] NCCL INFO comm 0x4944deb0 rank 6 nranks 8 cudaDev 6 busId 8e000 - Abort COMPLETE PHLRR4036:4555:4555 [2] NCCL INFO comm 0x47643180 rank 2 nranks 8 cudaDev 2 busId d000 - Abort COMPLETE [2023-03-03 03:41:22,399] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 4553 [2023-03-03 03:41:22,433] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 4554 [2023-03-03 03:41:23,542] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 4555 [2023-03-03 03:41:23,839] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 4556 [2023-03-03 03:41:23,842] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 4558 [2023-03-03 03:41:23,843] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 4560 [2023-03-03 03:41:23,846] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 4562 [2023-03-03 03:41:23,848] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 4564 [2023-03-03 03:41:23,850] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'bloom-inference-scripts/bloom-ds-inference.py', '--local_rank=7', '--name', 'facebook/opt-66b', '--batch_size', '4', '--tp_size', '4', '--benchmark'] exits with return code = 1

my log for mp_size =2


nstalled CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combinationInstalled CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination

Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/lambda7xx/.cache/torch_extensions/py38_cu117/transformer_inference/build.ninja... Building extension module transformer_inference... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module transformer_inference... Loading extension module transformer_inference... Loading extension module transformer_inference... Loading extension module transformer_inference... Time to load transformer_inference op: 0.5659897327423096 seconds Time to load transformer_inference op: 0.5149312019348145 seconds Time to load transformer_inference op: 0.5152051448822021 seconds [2023-03-03 03:40:39,841] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 9216, 'intermediate_size': 36864, 'heads': 72, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.ReLU: 2>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False} Time to load transformer_inference op: 0.5122606754302979 seconds Loading extension module transformer_inference... Loading extension module transformer_inference... Time to load transformer_inference op: 0.5149903297424316 seconds Time to load transformer_inference op: 0.5106439590454102 seconds Loading extension module transformer_inference... Time to load transformer_inference op: 0.6110687255859375 seconds Loading extension module transformer_inference... Time to load transformer_inference op: 0.6097698211669922 seconds Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module transformer_inference, skipping build step... Loading extension module transformer_inference... Time to load transformer_inference op: 0.09874415397644043 seconds Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module transformer_inference, skipping build step... Loading extension module transformer_inference... Time to load transformer_inference op: 0.10198044776916504 seconds Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module transformer_inference, skipping build step... Loading extension module transformer_inference... Time to load transformer_inference op: 0.10840225219726562 seconds Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module transformer_inference, skipping build step... Loading extension module transformer_inference... Time to load transformer_inference op: 0.11503076553344727 seconds Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module transformer_inference, skipping build step... Loading extension module transformer_inference... Time to load transformer_inference op: 0.10785055160522461 seconds Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module transformer_inference, skipping build step... Loading extension module transformer_inference... Time to load transformer_inference op: 0.10958147048950195 seconds Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module transformer_inference, skipping build step... Loading extension module transformer_inference... Time to load transformer_inference op: 0.10695767402648926 seconds Using /home/lambda7xx/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module transformer_inference, skipping build step... Loading extension module transformer_inference... Time to load transformer_inference op: 0.11722230911254883 seconds Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 532, in replace_transformer_layer replaced_module = replace_module(model=model, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 797, in replace_module replacedmodule, = _replace_module(model, policy) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 814, in _replace_module replaced_module = policies[child.class][0](child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 522, in replace_fn new_module = replace_with_policy(child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 383, in replace_with_policy _container.create_module() File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/opt.py", line 21, in create_module self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 18, in init super().init(config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 70, in init self.mlp = DeepSpeedMLP(self.config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_mlp.py", line 45, in init self.output_w = nn.Parameter(torch.empty(intm_size_per_partition, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB (GPU 0; 31.75 GiB total capacity; 31.08 GiB already allocated; 216.50 MiB free; 31.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 532, in replace_transformer_layer replaced_module = replace_module(model=model, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 797, in replace_module replacedmodule, = _replace_module(model, policy) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 814, in _replace_module replaced_module = policies[child.class][0](child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 522, in replace_fn new_module = replace_with_policy(child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 383, in replace_with_policy _container.create_module() File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/opt.py", line 21, in create_module self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 18, in init super().init(config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 70, in init self.mlp = DeepSpeedMLP(self.config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_mlp.py", line 45, in init self.output_w = nn.Parameter(torch.empty(intm_size_per_partition, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB (GPU 4; 31.75 GiB total capacity; 31.08 GiB already allocated; 216.50 MiB free; 31.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference model = deepspeed.init_inference(
engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 532, in replace_transformer_layer self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 532, in replace_transformer_layer replaced_module = replace_module(model=model, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 797, in replace_module replaced_module = replace_module(model=model, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 797, in replace_module replacedmodule, = _replace_module(model, policy) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replace_module replacedmodule, = _replace_module(model, policy) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 814, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 814, in _replace_module replaced_module = policies[child.class][0](child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 522, in replace_fn new_module = replace_with_policy(child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 383, in replace_with_policy replaced_module = policies[child.class][0](child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 522, in replace_fn _container.create_module() File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/opt.py", line 21, in create_module new_module = replace_with_policy(child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 383, in replace_with_policy self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 18, in init super().init(config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 70, in init _container.create_module() File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/opt.py", line 21, in create_module self.mlp = DeepSpeedMLP(self.config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_mlp.py", line 45, in init self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 18, in init self.output_w = nn.Parameter(torch.empty(intm_size_per_partition, torch.cuda .super().init(config,OutOfMemoryError : File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 70, in init CUDA out of memory. Tried to allocate 324.00 MiB (GPU 3; 31.75 GiB total capacity; 31.08 GiB already allocated; 216.50 MiB free; 31.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF self.mlp = DeepSpeedMLP(self.config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_mlp.py", line 45, in init self.output_w = nn.Parameter(torch.empty(intm_size_per_partition, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB (GPU 7; 31.75 GiB total capacity; 31.08 GiB already allocated; 216.50 MiB free; 31.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 532, in replace_transformer_layer replaced_module = replace_module(model=model, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 797, in replace_module replacedmodule, = _replace_module(model, policy) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 814, in _replace_module replaced_module = policies[child.class][0](child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 522, in replace_fn new_module = replace_with_policy(child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 383, in replace_with_policy _container.create_module() File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/opt.py", line 21, in create_module self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 18, in init super().init(config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 70, in init self.mlp = DeepSpeedMLP(self.config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_mlp.py", line 45, in init self.output_w = nn.Parameter(torch.empty(intm_size_per_partition, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB (GPU 2; 31.75 GiB total capacity; 31.08 GiB already allocated; 216.50 MiB free; 31.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 532, in replace_transformer_layer replaced_module = replace_module(model=model, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 797, in replace_module replacedmodule, = _replace_module(model, policy) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 814, in _replace_module replaced_module = policies[child.class][0](child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 522, in replace_fn new_module = replace_with_policy(child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 383, in replace_with_policy _container.create_module() File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/opt.py", line 21, in create_module self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 18, in init super().init(config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 70, in init self.mlp = DeepSpeedMLP(self.config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_mlp.py", line 45, in init self.output_w = nn.Parameter(torch.empty(intm_size_per_partition, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB (GPU 5; 31.75 GiB total capacity; 31.08 GiB already allocated; 216.50 MiB free; 31.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 532, in replace_transformer_layer replaced_module = replace_module(model=model, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 797, in replace_module replacedmodule, = _replace_module(model, policy) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 814, in _replace_module replaced_module = policies[child.class][0](child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 522, in replace_fn new_module = replace_with_policy(child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 383, in replace_with_policy _container.create_module() File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/opt.py", line 21, in create_module self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 18, in init super().init(config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 70, in init self.mlp = DeepSpeedMLP(self.config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_mlp.py", line 45, in init self.output_w = nn.Parameter(torch.empty(intm_size_per_partition, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB (GPU 6; 31.75 GiB total capacity; 31.08 GiB already allocated; 216.50 MiB free; 31.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "bloom-inference-scripts/bloom-ds-inference.py", line 185, in model = deepspeed.init_inference( File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/init.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy(config) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 358, in _apply_injection_policy replace_transformer_layer(client_module, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 532, in replace_transformer_layer replaced_module = replace_module(model=model, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 797, in replace_module replacedmodule, = _replace_module(model, policy) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 824, in _replacemodule , layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 814, in _replace_module replaced_module = policies[child.class][0](child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 522, in replace_fn new_module = replace_with_policy(child, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 383, in replace_with_policy _container.create_module() File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/opt.py", line 21, in create_module self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group) File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 18, in init super().init(config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 70, in init self.mlp = DeepSpeedMLP(self.config, File "/home/lambda7xx/.local/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_mlp.py", line 45, in init self.output_w = nn.Parameter(torch.empty(intm_size_per_partition, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB (GPU 1; 31.75 GiB total capacity; 31.08 GiB already allocated; 216.50 MiB free; 31.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF PHLRR4036:3531:3986 [0] NCCL INFO [Service thread] Connection closed by localRank 0


- opt-66b model should be about 132GB memory and I have 256GB GPU memory. I think it should not OOM

microsoft / DeepSpeed

[BUG] deepspeed inferene opt/66b model OOM on 8GPUs #2933