liangwq / Chatglm_lora_multi-gpu

chatglm多gpu用deepspeed和
404 stars 61 forks source link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select) #7

Open xiaoweiweixiao opened 1 year ago

xiaoweiweixiao commented 1 year ago

报错信息如下,请问是我哪里设置的有问题还是其它原因?

`Traceback (most recent call last):
  File "finetune.py", line 170, in <module>
    main()
  File "finetune.py", line 161, in main
    trainer.train()
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 2645, in training_step
    loss = self.compute_loss(model, inputs)
  File "finetune.py", line 103, in compute_loss
    return model(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/peft/peft_model.py", line 529, in forward
    return self.base_model(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 1043, in forward
    transformer_outputs = self.transformer(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 860, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004496574401855469 seconds
  0%|                                                                                                                                                                               | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  attention_scores.masked_fill_(attention_mask.byte(), -10000.0)
/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92653 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 92654) of binary: /home/la/anaconda3/envs/chatglm-tuning/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-31_14:02:55
  host      : guest-server
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 92654)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`
liangwq commented 1 year ago

报错信息如下,请问是我哪里设置的有问题还是其它原因?

`Traceback (most recent call last):
  File "finetune.py", line 170, in <module>
    main()
  File "finetune.py", line 161, in main
    trainer.train()
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 2645, in training_step
    loss = self.compute_loss(model, inputs)
  File "finetune.py", line 103, in compute_loss
    return model(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/peft/peft_model.py", line 529, in forward
    return self.base_model(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 1043, in forward
    transformer_outputs = self.transformer(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 860, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004496574401855469 seconds
  0%|                                                                                                                                                                               | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  attention_scores.masked_fill_(attention_mask.byte(), -10000.0)
/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92653 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 92654) of binary: /home/la/anaconda3/envs/chatglm-tuning/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-31_14:02:55
  host      : guest-server
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 92654)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`

有多卡,你直接用deepspeed方法来跑吧,代码都不用改

llplay commented 1 year ago

感觉是gpu个数和num_process不匹配

xiaoweiweixiao commented 1 year ago

报错信息如下,请问是我哪里设置的有问题还是其它原因?

`Traceback (most recent call last):
  File "finetune.py", line 170, in <module>
    main()
  File "finetune.py", line 161, in main
    trainer.train()
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 2645, in training_step
    loss = self.compute_loss(model, inputs)
  File "finetune.py", line 103, in compute_loss
    return model(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/peft/peft_model.py", line 529, in forward
    return self.base_model(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 1043, in forward
    transformer_outputs = self.transformer(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home2/la/chatgml-tuning/modeling_chatglm.py", line 860, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004496574401855469 seconds
  0%|                                                                                                                                                                               | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  attention_scores.masked_fill_(attention_mask.byte(), -10000.0)
/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92653 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 92654) of binary: /home/la/anaconda3/envs/chatglm-tuning/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-31_14:02:55
  host      : guest-server
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 92654)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`

有多卡,你直接用deepspeed方法来跑吧,代码都不用改

大佬,我就是按照您在chatGLM-tuning里面说的deepspeed方法跑的,我把你这个里面的modeling_chatglm.py文件替换到chatGLM-tuning那个代码库里面就行吧,是还有哪里没设置好吗?

xiaoweiweixiao commented 1 year ago

感觉是gpu个数和num_process不匹配

这是我的微调指令,gpu个数我设的是2个,num_process指的是哪个参数? CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc_per_node=2 finetune.py --dataset_path data_zh2/zh-data02 --lora_rank 8 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output_zh-data02 --deepspeed ds_config_zero3.json