microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.89k stars 4.06k forks source link

[BUG]RuntimeError: CUDA error: unknown error #3403

Closed SH0AN closed 1 year ago

SH0AN commented 1 year ago

Describe the bug Running an error

Log output Evaluating perplexity, Epoch 0/1 Traceback (most recent call last): File "main.py", line 345, in main() File "main.py", line 306, in main perplexity = evaluation(model, eval_dataloader) File "main.py", line 257, in evaluation outputs = model(batch) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1675, in forward loss = self.module(*inputs, kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 950, in forward logits = self.lm_head(outputs[0]).contiguous() File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: unknown error CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2023-04-28 09:46:00,441] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2469 [2023-04-28 09:46:00,442] [ERROR] [launch.py:434:sigkill_handler] ['/home/sh0an/anaconda3/envs/Chat/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

To Reproduce Execute script: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

PyTorch: 2.0 CUDA: 11.8

System info (please complete the following information):

zy-sunshine commented 1 year ago

RTX4070TI(12G) memory is not enough to train the ds-chat 1.3b model, I got this error before, because my GPU RTX3090 temperature is too high and GPU not working at that time.

molly-smith commented 1 year ago

Hi @SH0AN, as @zy-sunshine mentioned, one 12G GPU is not enough for this task.