what(): CUDA error: an illegal memory access was encountered

qinzhiliang commented 1 year ago

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1b1fd134d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f1b1fcdd36b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f1b1fdafb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x1c36b (0x7f1b1fd8036b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: + 0x2b930 (0x7f1b1fd8f930 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #5: + 0x4d56d6 (0x7f1b867306d6 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x3ee77 (0x7f1b1fcf8e77 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #7: c10::TensorImpl::copy_tensor_metadata_except_version_counter(c10::TensorImpl const, c10::TensorImpl, bool) + 0x41 (0x7f1b1fcf3391 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #8: c10::TensorImpl::copy_tensor_metadata(c10::TensorImpl const, c10::TensorImpl, c10::VariableVersion const&, bool) + 0x14 (0x7f1b1fcf3404 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)

step1: deepspeed main.py \ --data_path bote/gpt_part_data \ --data_split 2,4,4 \ --model_name_or_path FreedomIntelligence/phoenix-inst-chat-7b \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --max_seq_len 512 \ --learning_rate 9.65e-6 \ --weight_decay 0. \ --num_train_epochs 16 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --num_warmup_steps 0 \ --seed 1234 \ --gradient_checkpointing \ --zero_stage $ZERO_STAGE \ --deepspeed \ --output_dir $OUTPUT \ 2>&1 | tee $OUTPUT/training.log

step2: deepspeed main.py \ --data_path bote/whoareyou \ --data_split 2,4,4 \ --model_name_or_path bigscience/bloomz-560m \ --num_padding_at_beginning 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --max_seq_len 512 \ --learning_rate 5e-5 \ --weight_decay 0.1 \ --num_train_epochs 1 \ --disable_dropout \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --num_warmup_steps 0 \ --seed 1234 \ --zero_stage $ZERO_STAGE \ --deepspeed \ --output_dir $OUTPUT \ 2>&1 | tee $OUTPUT/training.log

step3: deepspeed --master_port 12346 main.py \ --data_path bote/gpt_part_data \ --data_split 2,4,4 \ --actor_model_name_or_path $ACTOR_MODEL_PATH \ --critic_model_name_or_path $CRITIC_MODEL_PATH \ --num_padding_at_beginning 1 \ --per_device_train_batch_size 1 \ --per_device_mini_train_batch_size 1 \ --generation_batch_numbers 1 \ --ppo_epochs 1 \ --max_answer_seq_len 256 \ --max_prompt_seq_len 256 \ --actor_learning_rate ${Actor_Lr} \ --critic_learning_rate ${Critic_Lr} \ --actor_weight_decay 0.1 \ --critic_weight_decay 0.1 \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --gradient_accumulation_steps 1 \ --actor_gradient_checkpointing \ --disable_actor_dropout \ --num_warmup_steps 100 \ --deepspeed --seed 1234 \ --enable_hybrid_engine \ --actor_zero_stage $ACTOR_ZERO_STAGE \ --critic_zero_stage $CRITIC_ZERO_STAGE \ --output_dir $OUTPUT \ 2>&1 | tee $OUTPUT/training.log

GPU： 8 * A40（48G）