hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.74k stars 4.34k forks source link

[BUG]: Chat with GPTRM #3482

Open lljjgg opened 1 year ago

lljjgg commented 1 year ago

🐛 Describe the bug

When I train the rm model, it can be trained normally,The script is as follows: torchrun --standalone --nproc_per_node=1 \ ./examples/train_reward_model.py \ --dataset './datasets/reward_model/train.json' \ --vocab_file 'vocab.txt' \ --pretrain './pretrained_models/gpt1.3B' \ --model 'gpt2' \ --batch_size 1 \ --seq_len 2048 \ --max_epochs 1 \ --strategy colossalai_zero2 but I modify the code to model = GPTRM(pretrained=args.pretrain,checkpoint=True).cuda() Enable gradient_checkpointing The following error occurred: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. train(args) File "./examples/train_reward_model.py", line 98, in train trainer.fit() File "/data/nfs/luojiangang/ColossalAI-main_c/applications/ChatGPT/chatgpt/trainer/rm.py", line 68, in fit self.strategy.backward(loss, self.model, self.optimizer) File "/data/nfs/luojiangang/ColossalAI-main_c/applications/ChatGPT/chatgpt/trainer/strategies/colossalai.py", line 133, in backward optimizer.backward(loss) File "/opt/conda/lib/python3.8/site-packages/colossalai/nn/optimizer/zero_optimizer.py", line 240, in backward self.module.backward(loss) File "/opt/conda/lib/python3.8/site-packages/colossalai/nn/parallel/data_parallel.py", line 323, in backward loss.backward() File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 479, in backward return handle_torch_function( File "/opt/conda/lib/python3.8/site-packages/torch/overrides.py", line 1534, in handle_torch_function result = torch_func_method(public_api, types, args, kwargs) File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/colo_tensor.py", line 181, in __torch_function__ return backward_tensor.backward(tensor_kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/opt/conda/lib/python3.8/site-packages/colossalai/nn/parallel/data_parallel.py", line 348, in grad_handle self.overflow_counter += chunk.has_inf_or_nan File "/opt/conda/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py", line 237, in has_inf_or_nan return torch.isinf(valid_tensor).any().item() | torch.isnan(valid_tensor).any().item() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Train epoch: 0%| | 0/1 [00:05<?, ?it/s] Train step of epoch 0: 0%| | 0/20 [00:05<?, ?it/s] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 38899) of binary: /opt/conda/bin/python ** And changing the strategy to ddp can solve this problem Do you know the reason, looking forward to your reply

Environment

No response

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: [BUG]:

JThh commented 1 year ago

Is that the only line you changed?

model = GPTRM(pretrained=args.pretrain, lora_rank=args.lora_rank, checkpoint=True).to(torch.cuda.current_device())

lljjgg commented 1 year ago

Yes,Only this one place has been modified