Closed SH0AN closed 1 year ago
RTX4070TI(12G) memory is not enough to train the ds-chat 1.3b model, I got this error before, because my GPU RTX3090 temperature is too high and GPU not working at that time.
Hi @SH0AN, as @zy-sunshine mentioned, one 12G GPU is not enough for this task.
Describe the bug Running an error
Log output Evaluating perplexity, Epoch 0/1 Traceback (most recent call last): File "main.py", line 345, in
main()
File "main.py", line 306, in main
perplexity = evaluation(model, eval_dataloader)
File "main.py", line 257, in evaluation
outputs = model(batch)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, *kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(args, kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1675, in forward
loss = self.module(*inputs, kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, *kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 950, in forward
logits = self.lm_head(outputs[0]).contiguous()
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.[2023-04-28 09:46:00,441] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2469 [2023-04-28 09:46:00,442] [ERROR] [launch.py:434:sigkill_handler] ['/home/sh0an/anaconda3/envs/Chat/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1
To Reproduce Execute script: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
PyTorch: 2.0 CUDA: 11.8
System info (please complete the following information):