shuxueslpi / chatGLM-6B-QLoRA

使用peft库,对chatGLM-6B/chatGLM2-6B实现4bit的QLoRA高效微调,并做lora model和base model的merge及4bit的量化(quantize)。
350 stars 46 forks source link

对chatglm2进行lora微调时,提示CUDA error: invalid argument,麻烦大佬看一下 #19

Open LKk8563 opened 1 year ago

LKk8563 commented 1 year ago

对chatglm2进行lora微调时,提示CUDA error: invalid argument;使用的windows系统,python310,cuda:11.8 PS E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main> python train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path THUDM/chatglm2-6b --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp16

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll CUDA SETUP: CUDA runtime path found: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\cudart64_110.dll CUDA SETUP: Highest compute capability among GPUs detected: 8.9 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll... The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 7/7 [00:06<00:00, 1.11it/s] trainable params: 974,848 || all params: 3,389,286,400 || trainable%: 0.0287626327477076 Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-d642ff6439cea90e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 243.11it/s] Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-bf648ec70cbcb4a4/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s] wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: Tracking run with wandb version 0.15.3 wandb: W&B syncing is set to offline in this directory. wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. 0%| | 0/3581 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... Traceback (most recent call last): File "E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main\train_qlora.py", line 206, in train(args) File "E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main\train_qlora.py", line 200, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\trainer.py", line 1645, in train return inner_training_loop( File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\trainer.py", line 1938, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\trainer.py", line 2770, in training_step self.accelerator.backward(loss) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\accelerator.py", line 1821, in backward loss.backward(*kwargs) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch_tensor.py", line 487, in backward torch.autograd.backward( File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\function.py", line 274, in apply return user_fn(self, args) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main\train_qlora.py:206 in │ │ │ │ 203 │ │ 204 if name == "main": │ │ 205 │ args = parse_args() │ │ ❱ 206 │ train(args) │ │ 207 │ │ 208 │ │ │ │ E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main\train_qlora.py:200 in train │ │ │ │ 197 │ │ data_collator=data_collator │ │ 198 │ ) │ │ 199 │ │ │ ❱ 200 │ trainer.train(resume_from_checkpoint=resume_from_checkpoint) │ │ 201 │ trainer.model.save_pretrained(hf_train_args.output_dir) │ │ 202 │ │ 203 │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tr │ │ ainer.py:1645 in train │ │ │ │ 1642 │ │ inner_training_loop = find_executable_batch_size( │ │ 1643 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │ │ 1644 │ │ ) │ │ ❱ 1645 │ │ return inner_training_loop( │ │ 1646 │ │ │ args=args, │ │ 1647 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1648 │ │ │ trial=trial, │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tr │ │ ainer.py:1938 in _inner_training_loop │ │ │ │ 1935 │ │ │ │ │ self.control = self.callback_handler.on_step_begin(args, self.state, │ │ 1936 │ │ │ │ │ │ 1937 │ │ │ │ with self.accelerator.accumulate(model): │ │ ❱ 1938 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1939 │ │ │ │ │ │ 1940 │ │ │ │ if ( │ │ 1941 │ │ │ │ │ args.logging_nan_inf_filter │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tr │ │ ainer.py:2770 in training_step │ │ │ │ 2767 │ │ │ with amp.scale_loss(loss, self.optimizer) as scaled_loss: │ │ 2768 │ │ │ │ scaled_loss.backward() │ │ 2769 │ │ else: │ │ ❱ 2770 │ │ │ self.accelerator.backward(loss) │ │ 2771 │ │ │ │ 2772 │ │ return loss.detach() / self.args.gradient_accumulation_steps │ │ 2773 │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\acce │ │ lerator.py:1821 in backward │ │ │ │ 1818 │ │ elif self.scaler is not None: │ │ 1819 │ │ │ self.scaler.scale(loss).backward(kwargs) │ │ 1820 │ │ else: │ │ ❱ 1821 │ │ │ loss.backward(kwargs) │ │ 1822 │ │ │ 1823 │ def unscale_gradients(self, optimizer=None): │ │ 1824 │ │ """ │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch_tensor.p │ │ y:487 in backward │ │ │ │ 484 │ │ │ │ create_graph=create_graph, │ │ 485 │ │ │ │ inputs=inputs, │ │ 486 │ │ │ ) │ │ ❱ 487 │ │ torch.autograd.backward( │ │ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │ │ 489 │ │ ) │ │ 490 │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\ │ │ init.py:200 in backward │ │ │ │ 197 │ # The reason we repeat same the comment below is that │ │ 198 │ # some Python versions print out the first line of a multi-line function │ │ 199 │ # calls in the traceback and some print out the last line │ │ ❱ 200 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │ │ 201 │ │ tensors, gradtensors, retain_graph, create_graph, inputs, │ │ 202 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │ │ 203 │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\ │ │ function.py:274 in apply │ │ │ │ 271 │ │ │ │ │ │ │ "Function is not allowed. You should only implement one " │ │ 272 │ │ │ │ │ │ │ "of them.") │ │ 273 │ │ user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn │ │ ❱ 274 │ │ return user_fn(self, args) │ │ 275 │ │ │ 276 │ def apply_jvp(self, args): │ │ 277 │ │ # _forward_cls is defined by derived class │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\che │ │ ckpoint.py:157 in backward │ │ │ │ 154 │ │ │ raise RuntimeError( │ │ 155 │ │ │ │ "none of output has requires_grad=True," │ │ 156 │ │ │ │ " this checkpoint() is not necessary") │ │ ❱ 157 │ │ torch.autograd.backward(outputs_with_grad, args_with_grad) │ │ 158 │ │ grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None │ │ 159 │ │ │ │ │ for inp in detached_inputs) │ │ 160 │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\ │ │ init.py:200 in backward │ │ │ │ 197 │ # The reason we repeat same the comment below is that │ │ 198 │ # some Python versions print out the first line of a multi-line function │ │ 199 │ # calls in the traceback and some print out the last line │ │ ❱ 200 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │ │ 201 │ │ tensors, gradtensors, retain_graph, create_graph, inputs, │ │ 202 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │ │ 203 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

shuxueslpi commented 1 year ago

@LKk8563 你的问题解决没?硬件环境是怎样的?GPU是什么型号的?

Derican commented 1 year ago

同问,我这里也遇到这个问题,docker环境下可以运行chatGLM2-6B的cli_demo