microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.02k stars 1.02k forks source link

Step 3: RuntimeError: CUDA error: misaligned address #385

Open EikeKohl opened 1 year ago

EikeKohl commented 1 year ago

I try to run RLHF for my previously trained Actor and Reward model. However, I encounter the following Exception:

Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 516, in <module>
    main()
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 425, in main
    out = trainer.generate_experience(prompts)
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience
    seq = self._generate_sequence(prompts)
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 238, in generate
    with GatheredParameters(non_active_layers):
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1596, in __exit__
    self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 913, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1050, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1060, in _partition_param
    assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor.py", line 873, in __format__
    return object.__format__(self, format_spec)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/nn/parameter.py", line 60, in __repr__
    return 'Parameter containing:\n' + super().__repr__()
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor.py", line 426, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor_str.py", line 636, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor_str.py", line 567, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor_str.py", line 309, in _tensor_str
    self = self.float()
RuntimeError: CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I then set TORCH_USE_CUDA_DSA=1 and CUDA_LAUNCH_BLOCKING=1 for better debugging this results in the following, more elaborate error message:

******************[end] Initialized Reward Model [end] (duration: 2.13s)******************
***** Running training *****
Beginning of Epoch 1/1, Total Generation Batches 3692
------------------------------------------------------
Free memory : 9.476501 (GigaBytes)  
Total memory: 14.620972 (GigaBytes)  
Requested memory: 0.073242 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7fea90000000 
------------------------------------------------------
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 516, in <module>
    main()
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 425, in main
    out = trainer.generate_experience(prompts)
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience
    seq = self._generate_sequence(prompts)
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 254, in generate
    generate_ret_vals = self._generate(*inputs, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/transformers/generation/utils.py", line 1508, in generate
    return self.greedy_search(
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/transformers/generation/utils.py", line 2325, in greedy_search
    outputs = self(
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1097, in forward
    lm_logits = self.lm_head(hidden_states)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/deepspeed/module_inject/layers.py", line 50, in forward
    output = torch.matmul(input, self.weight.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1670525539683/work/c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feb8784f457 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7feb878193ec in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7febb956c044 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x164bc (0x7febb95434bc in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7febb9546434 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4cf653 (0x7febcf6f9653 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7feb8782f9e0 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7feb8782faf9 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x72d9c8 (0x7febcf9579c8 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a5 (0x7febcf957cb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x127cc8 (0x564c098abcc8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #11: <unknown function> + 0x24be98 (0x564c099cfe98 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #12: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #13: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #14: <unknown function> + 0x127cc8 (0x564c098abcc8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #15: <unknown function> + 0x24be98 (0x564c099cfe98 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #16: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #17: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #18: <unknown function> + 0x127cc8 (0x564c098abcc8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #19: <unknown function> + 0x24be98 (0x564c099cfe98 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #20: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #21: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #22: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #23: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #24: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #25: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #26: <unknown function> + 0x1348e8 (0x564c098b88e8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #27: <unknown function> + 0x14860e (0x564c098cc60e in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #28: <unknown function> + 0x1485fb (0x564c098cc5fb in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #29: <unknown function> + 0x11c661 (0x564c098a0661 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #30: PyDict_SetItemString + 0x4a (0x564c098a66aa in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #31: <unknown function> + 0x21470c (0x564c0999870c in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #32: Py_FinalizeEx + 0x186 (0x564c09997856 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #33: Py_RunMain + 0x112 (0x564c0998afe2 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #34: Py_BytesMain + 0x39 (0x564c0995d979 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #35: __libc_start_main + 0xea (0x7fec0ca6113a in /lib64/libc.so.6)
frame #36: <unknown function> + 0x1d9881 (0x564c0995d881 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)

This is my bash script:

export OUTPUT_PATH=./output
mkdir -p $OUTPUT_PATH

ACTOR_ZERO_STAGE="--actor_zero_stage 2"
CRITIC_ZERO_STAGE="--critic_zero_stage 2"
ACTOR_MODEL_PATH="../step1_supervised_finetuning/output" # Provide the ckpt path of the actor model
CRITIC_MODEL_PATH="../step2_reward_model_finetuning/output" # Provide the ckpt path of the critic model

Actor_Lr=5e-4
Critic_Lr=5e-6

deepspeed --master_port 12346 main.py \
   --data_path my_dataset.json \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 0 \
   --per_device_train_batch_size 1 \
   --per_device_mini_train_batch_size 1 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 512 \
   --max_prompt_seq_len 512 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   ${ACTOR_ZERO_STAGE} \
   ${CRITIC_ZERO_STAGE} \
   --actor_lora_dim 128 \
   --enable_hybrid_engine \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT_PATH \
    &> $OUTPUT_PATH/training.log
XiaoLaoDi commented 1 year ago

@EikeKohl have you solved this problem, same Error

EikeKohl commented 1 year ago

Hey @XiaoLaoDi not yet, but here is what I tried so far:

As for the CUDA setup, I tried

However, the error remains...

EikeKohl commented 1 year ago

@XiaoLaoDi How does your setup look like? Maybe we can identify similiarities and possible problem areas

ruihan0495 commented 1 year ago

maybe see this issue https://github.com/microsoft/DeepSpeedExamples/issues/335#issuecomment-1521105300

EikeKohl commented 1 year ago

@ruihan0495 thank you for the info. Not using the DeepSpeed-HE does indeed make a training possible 🙂. I run into another exception a little later in the code, but that is probably due to poor model quality. I am currently working on fixing that issue as well.

DehongXu commented 1 year ago

@EikeKohl I got the same error when I trained GPT2 and I also used AWS EC2 instance. Did you figure out what the problem is?

EikeKohl commented 1 year ago

@DehongXu tbh I didn't use Deepspeed RLHF in a while, but I remember that there was a known issue with the hybrid engine that was supposed to be fixed in upcoming updates. Disabling the usage of the hybrid engine made it worked for me at that time.