Open xiaoweiweixiao opened 1 year ago
报错信息如下,请问是我哪里设置的有问题还是其它原因?
`Traceback (most recent call last): File "finetune.py", line 170, in <module> main() File "finetune.py", line 161, in main trainer.train() File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 2645, in training_step loss = self.compute_loss(model, inputs) File "finetune.py", line 103, in compute_loss return model( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1846, in forward loss = self.module(*inputs, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/peft/peft_model.py", line 529, in forward return self.base_model( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 1043, in forward transformer_outputs = self.transformer( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 860, in forward inputs_embeds = self.word_embeddings(input_ids) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl result = forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward return F.embedding( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select) Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0004496574401855469 seconds 0%| | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.) attention_scores.masked_fill_(attention_mask.byte(), -10000.0) /home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92653 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 92654) of binary: /home/la/anaconda3/envs/chatglm-tuning/bin/python Traceback (most recent call last): File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ finetune.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-31_14:02:55 host : guest-server rank : 1 (local_rank: 1) exitcode : 1 (pid: 92654) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`
有多卡,你直接用deepspeed方法来跑吧,代码都不用改
感觉是gpu个数和num_process不匹配
报错信息如下,请问是我哪里设置的有问题还是其它原因?
`Traceback (most recent call last): File "finetune.py", line 170, in <module> main() File "finetune.py", line 161, in main trainer.train() File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 2645, in training_step loss = self.compute_loss(model, inputs) File "finetune.py", line 103, in compute_loss return model( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1846, in forward loss = self.module(*inputs, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/peft/peft_model.py", line 529, in forward return self.base_model( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 1043, in forward transformer_outputs = self.transformer( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 860, in forward inputs_embeds = self.word_embeddings(input_ids) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl result = forward_call(*input, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward return F.embedding( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select) Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0004496574401855469 seconds 0%| | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.) attention_scores.masked_fill_(attention_mask.byte(), -10000.0) /home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92653 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 92654) of binary: /home/la/anaconda3/envs/chatglm-tuning/bin/python Traceback (most recent call last): File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ finetune.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-31_14:02:55 host : guest-server rank : 1 (local_rank: 1) exitcode : 1 (pid: 92654) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`
有多卡,你直接用deepspeed方法来跑吧,代码都不用改
大佬,我就是按照您在chatGLM-tuning里面说的deepspeed方法跑的,我把你这个里面的modeling_chatglm.py文件替换到chatGLM-tuning那个代码库里面就行吧,是还有哪里没设置好吗?
感觉是gpu个数和num_process不匹配
这是我的微调指令,gpu个数我设的是2个,num_process指的是哪个参数? CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc_per_node=2 finetune.py --dataset_path data_zh2/zh-data02 --lora_rank 8 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output_zh-data02 --deepspeed ds_config_zero3.json
报错信息如下,请问是我哪里设置的有问题还是其它原因?