[BUG] CUDA error: no kernel image is available for execution on the device

getao commented 1 month ago

Related issue: https://github.com/microsoft/DeepSpeed/issues/5724#issuecomment-2330819411 But I tried the solution and found it didn't work in my setting.

Describe the bug rank1: Traceback (most recent call last): rank1: File "my_code.py", line 166, in train_model rank1: train_result = trainer.train(resume_from_checkpoint=checkpoint)

rank1: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train rank1: return inner_training_loop(

rank1: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop rank1: tr_loss_step = self.training_step(model, inputs)

rank1: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 3349, in training_step rank1: self.accelerator.backward(loss, kwargs) rank1: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2188, in backward rank1: self.deepspeed_engine_wrapped.backward(loss, kwargs) rank1: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward

rank1: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2204, in step

rank1: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2110, in _take_model_step

rank1: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1910, in step

rank1: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1816, in _optimizer_step

rank1: File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 484, in wrapper rank1: out = func(*args, **kwargs)

rank1: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step rank1: multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32], rank1: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in call rank1: return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)

rank1: RuntimeError: CUDA error: no kernel image is available for execution on the device rank1: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

To Reproduce Steps to reproduce the behavior: Trainer to train GPT with transformers=4.44.2 using pytorch 2.4 official docker (tried both cuda 12.4 and 12.1). I tried deepspeed 0.14.5 and 0.15.1 but both failed.

Expected behavior No Error.

ds_report output [2024-09-18 07:47:07,482] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (override) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. /opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @autocast_custom_fwd /opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. @autocast_custom_bwd

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [YES] ...... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] [WARNING] FP Quantizer is using an untested triton version (3.0.0), only 2.3.0 and 2.3.1 are known to be compatible with these kernels fp_quantizer ........... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/opt/conda/lib/python3.11/site-packages/torch'] torch version .................... 2.4.0 deepspeed install path ........... ['/opt/conda/lib/python3.11/site-packages/deepspeed'] deepspeed info ................... 0.14.5, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.4, cuda 12.1 shared memory (/dev/shm) size .... 994.00 GB

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types: 8 H100
Python version: 3.11
transformers==4.44.2

Launcher context deepspeed my_code.py

Docker context https://hub.docker.com/layers/pytorch/pytorch/2.4.0-cuda12.1-cudnn9-devel/images/sha256-a55ff10111eb11f998884327d37361592e632899edd24fce99886b69289e33e6?context=explore

loadams commented 1 month ago

Hi @getao - if you do build with DS_BUILD_FISED_ADAM=1 pip install deepspeed do you get the same erorr?

getao commented 1 month ago

Hi @getao - if you do build with DS_BUILD_FISED_ADAM=1 pip install deepspeed do you get the same erorr?

Yes, I built with DS_BUILD_FUSED_ADAM=1.

BTW, I don't get the error on A100 GPUs. Only hopper GPUs will get the error.

SamMicheals commented 1 month ago

I've got the same error on Windows Server 2022. I can also confirm that the same setup works on A100 GPUs but not the H100s. DS_BUILD_FUSED_ADAM=1 didn't work for me either.

ds_report output

[2024-09-27 13:00:07,357] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
LINK : fatal error LNK1181: cannot open input file 'aio.lib'
test.c
LINK : fatal error LNK1181: cannot open input file 'cufile.lib'
W0927 13:00:10.826000 14536 torch\distributed\elastic\multiprocessing\redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
test.c
LINK : fatal error LNK1181: cannot open input file 'aio.lib'
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_lion ............... [YES] ...... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  please install triton==2.3.0, 2.3.1 or 3.0.0 if you want to use the FP Quantizer Kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [YES] ...... [OKAY]
fused_lion ............. [YES] ...... [OKAY]
test.c
LINK : fatal error LNK1181: cannot open input file 'cufile.lib'
gds .................... [NO] ....... [NO]
inference_core_ops ..... [YES] ...... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['C:\\Users\\SamDev\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\sam-deepspeed-HJjkcE65-py3.11\\Lib\\site-packages\\torch']
torch version .................... 2.3.1+cu121
deepspeed install path ........... ['C:\\Users\\SamDev\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\sam-deepspeed-HJjkcE65-py3.11\\Lib\\site-packages\\deepspeed']
deepspeed info ................... 0.15.1+10ba3dde, 10ba3dde, HEAD
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.4
deepspeed wheel compiled w. ...... torch 2.3, cuda 12.1
shared memory (/dev/shm) size .... UNKNOWN

System info:

OS: Windows Server 2022 21H2
GPU count and types: 4 x H100 80GB
(if applicable) Hugging Face Transformers/Accelerate/etc. versions: transformers@4.45.1
Python version: 3.11.9

Code I an using the sample code from the Inference tutorial

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           tensor_parallel={"tp_size": world_size},
                                           dtype=torch.float,
                                           replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

Log

(sam-deepspeed-py3.11) PS C:\Users\SamDev\Documents\Personable\sam-deepspeed\sam_deepspeed> deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py
[2024-09-27 12:46:24,377] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
LINK : fatal error LNK1181: cannot open input file 'aio.lib'
test.c
LINK : fatal error LNK1181: cannot open input file 'cufile.lib'
W0927 12:46:30.472000 33576 torch\distributed\elastic\multiprocessing\redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-09-27 12:46:32,269] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-09-27 12:46:32,269] [INFO] [runner.py:585:main] cmd = C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Scripts\python.exe -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None gpt-neo-2.7b-generation.py
[2024-09-27 12:46:34,172] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
LINK : fatal error LNK1181: cannot open input file 'aio.lib'
test.c
LINK : fatal error LNK1181: cannot open input file 'cufile.lib'
W0927 12:46:37.648000 47128 torch\distributed\elastic\multiprocessing\redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-09-27 12:46:38,393] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-27 12:46:38,393] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-27 12:46:38,393] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-27 12:46:38,393] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-27 12:46:38,393] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-27 12:46:38,393] [INFO] [launch.py:256:main] process 39592 spawned with command: ['C:\\Users\\SamDev\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\sam-deepspeed-HJjkcE65-py3.11\\Scripts\\python.exe', '-u', 'gpt-neo-2.7b-generation.py', '--local_rank=0']
[2024-09-27 12:46:38,393] [INFO] [launch.py:256:main] process 47220 spawned with command: ['C:\\Users\\SamDev\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\sam-deepspeed-HJjkcE65-py3.11\\Scripts\\python.exe', '-u', 'gpt-neo-2.7b-generation.py', '--local_rank=1']
[2024-09-27 12:46:40,457] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-27 12:46:40,457] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
test.c
LINK : fatal error LNK1181: cannot open input file 'aio.lib'
LINK : fatal error LNK1181: cannot open input file 'aio.lib'
test.c
LINK : fatal error LNK1181: cannot open input file 'cufile.lib'
test.c
LINK : fatal error LNK1181: cannot open input file 'cufile.lib'
W0927 12:46:44.269000 42844 torch\distributed\elastic\multiprocessing\redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
W0927 12:46:44.330000 42532 torch\distributed\elastic\multiprocessing\redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
[2024-09-27 12:46:52,173] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.1+10ba3dde, git-hash=10ba3dde, git-branch=HEAD
[2024-09-27 12:46:52,173] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2024-09-27 12:46:52,173] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-27 12:46:52,173] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[2024-09-27 12:46:52,830] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.1+10ba3dde, git-hash=10ba3dde, git-branch=HEAD
[2024-09-27 12:46:52,830] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2024-09-27 12:46:52,830] [INFO] [comm.py:652:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[2024-09-27 12:46:52,972] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1, 'rope_theta': 10000, 'invert_mask': True}
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
------------------------------------------------------
Free memory : 24.792908 (GigaBytes)
Total memory: 79.545776 (GigaBytes)
Requested memory: 0.449219 (GigaBytes)
Setting maximum total tokens (input + output) to 1024
WorkSpace: 000000162DE00000
------------------------------------------------------
[rank0]: Traceback (most recent call last):
[rank0]:   File "C:\Users\SamDev\Documents\Personable\sam-deepspeed\sam_deepspeed\gpt-neo-2.7b-generation.py", line 18, in <module>
[rank0]:     string = generator("DeepSpeed is", do_sample=True, min_length=50)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\text_generation.py", line 272, in __call__
[rank0]:     return super().__call__(text_inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\base.py", line 1268, in __call__
[rank0]:     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\base.py", line 1275, in run_single
[rank0]:     model_outputs = self.forward(model_inputs, **forward_params)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\base.py", line 1175, in forward
[rank0]:     model_outputs = self._forward(model_inputs, **forward_params)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\text_generation.py", line 370, in _forward
[rank0]:     generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\deepspeed\inference\engine.py", line 639, in _generate
[rank0]:     return self.module.generate(*inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\generation\utils.py", line 2048, in generate
[rank0]:     result = self._sample(
[rank0]:              ^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\generation\utils.py", line 3008, in _sample
[rank0]:     outputs = self(**model_inputs, return_dict=True)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 1059, in forward
[rank0]:     lm_logits = self.lm_head(hidden_states)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\linear.py", line 116, in forward
[rank0]:     return F.linear(input, self.weight, self.bias)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank1]: Traceback (most recent call last):
[rank1]:   File "C:\Users\SamDev\Documents\Personable\sam-deepspeed\sam_deepspeed\gpt-neo-2.7b-generation.py", line 18, in <module>
[rank1]:     string = generator("DeepSpeed is", do_sample=True, min_length=50)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\text_generation.py", line 272, in __call__
[rank1]:     return super().__call__(text_inputs, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\base.py", line 1268, in __call__
[rank1]:     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\base.py", line 1275, in run_single
[rank1]:     model_outputs = self.forward(model_inputs, **forward_params)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\base.py", line 1175, in forward
[rank1]:     model_outputs = self._forward(model_inputs, **forward_params)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\pipelines\text_generation.py", line 370, in _forward
[rank1]:     generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
[rank1]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\deepspeed\inference\engine.py", line 639, in _generate
[rank1]:     return self.module.generate(*inputs, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\generation\utils.py", line 2048, in generate
[rank1]:     result = self._sample(
[rank1]:              ^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\generation\utils.py", line 3008, in _sample
[rank1]:     outputs = self(**model_inputs, return_dict=True)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 1059, in forward
[rank1]:     lm_logits = self.lm_head(hidden_states)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "C:\Users\SamDev\AppData\Local\pypoetry\Cache\virtualenvs\sam-deepspeed-HJjkcE65-py3.11\Lib\site-packages\torch\nn\modules\linear.py", line 116, in forward
[rank1]:     return F.linear(input, self.weight, self.bias)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2024-09-27 12:46:54,403] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 39592
[2024-09-27 12:46:54,407] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 47220
[2024-09-27 12:46:54,407] [ERROR] [launch.py:325:sigkill_handler] ['C:\\Users\\SamDev\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\sam-deepspeed-HJjkcE65-py3.11\\Scripts\\python.exe', '-u', 'gpt-neo-2.7b-generation.py', '--local_rank=1'] exits with return code = 1

loadams commented 1 month ago

Thanks @getao and @SamMicheals for confirming, we will need to take a closer look at this

mzamini92 commented 2 weeks ago

same issue +1. OS: Linux GPU count and types: 8 x H100 80GB (if applicable) Hugging Face Transformers/Accelerate/etc. versions: transformers@4.40 & 4.42 Python version: 3.10 deepspeed: 0.15.2 & 0.14.2 & 0.14.4 PyTorch: 2.4.0

By modifying the python3.10/site-packages/deepspeed/runtime/engine.py and commenting out the lines:

                    optimizer = torch.optim.Adam(model_parameters, **optimizer_parameters)
                    # from deepspeed.ops.adam import FusedAdam

                    # optimizer = FusedAdam(
                    #     model_parameters,
                    #     **optimizer_parameters,
                    #     adam_w_mode=effective_adam_w_mode,
                    # )

I was able to bypass the error but not sure what would be the effect of using Adam compared to FusedAdam.

bwnotfound commented 2 weeks ago

same issue +1. OS: Ubuntu20.04 GPU count and types: 8 x H100 80GB CUDA Version: 12.2 Pytorch: 2.3.1+cu118 Python: 3.10.15 deepspeed: 0.15.3 & 0.14.5 【2024-10-25 13:03:33】[rank0]: File "/nfs/volume-764-2/lirui/envs/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1816, in _optimizer_step 1181 【2024-10-25 13:03:33】[rank0]: self.optimizer.step() 1182 【2024-10-25 13:03:33】[rank0]: File "/nfs/volume-764-2/lirui/envs/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper 1183 【2024-10-25 13:03:33】[rank0]: out = func(*args, *kwargs) 1184 【2024-10-25 13:03:33】[rank0]: File "/nfs/volume-764-2/lirui/envs/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step 1185 【2024-10-25 13:03:33】[rank0]: multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32], 1186 【2024-10-25 13:03:33】[rank0]: File "/nfs/volume-764-2/lirui/envs/lib/python3.10/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in call 1187 【2024-10-25 13:03:33】[rank0]: return op(self.chunk_size, noop_flag_buffer, tensor_lists, args) 1188 【2024-10-25 13:03:33】[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device 1189 【2024-10-25 13:03:33】[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 1190 【2024-10-25 13:03:33】[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 1191 【2024-10-25 13:03:33】[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Only H800 GPUs can reproduce this bug. A100s GPU cannot reproduce this bug. When I choose to run with DeepspeedCPUAdam, H100s GPU got no error. BTW, CUDA Version 12.2 seem to have no bad effect when trainning on A100s GPU.

loadams commented 2 weeks ago

Hi @getao, @bwnotfound, @mzamini92 - could you please test with the linked PR here. #6669?

microsoft / DeepSpeed

[BUG] CUDA error: no kernel image is available for execution on the device #6549