多卡chatglm2 sft报错RuntimeError: expected scalar type Half but found Float

zhr0313 commented 1 year ago

│ 1892 │ │ │ │ │ │ │ ││ 2670 │ │ │ labels = None │ │ /root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in _call_impl │ │ ❱ 848 │ │ transformer_outputs = self.transformer( ││ 168 │ module.forward = new_forward ││ new_forward ││ ❱ 1501 │ │ │ return forward_call(*args, kwargs) │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks ││ 1892 │ │ │ │ │ │ │ 2670 │ │ │ labels = None │ │ ❱ 848 │ │ transformer_outputs = self.transformer( │ │ │ │ 164 │ │ else: │ │ ❱ 1501 │ │ │ return forward_call(*args, *kwargs) │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(args, kwargs) │ │ 1504 │ │ backward_prehooks = [] │ │ │ │ /root/.local/lib/python3.9/site-packages/torch/nn/modules/linear.py:114 in forward │ │ │ │ 111 │ │ │ init.uniform(self.bias, -bound, bound) │ │ 112 │ │ │ 113 │ def forward(self, input: Tensor) -> Tensor: │ │ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │ │ 115 │ │ │ 116 │ def extra_repr(self) -> str: │ │ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: expected scalar type Half but found Float

在第500步报错（save_steps =500），sh CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 supervised_finetuning.py \ 单卡是正常的

zhr0313 commented 1 year ago

经测试，应该是在eval的时候报错

zhr0313 commented 1 year ago

with torch.autocast("cuda"): train 和eval代码前都加上，解决问题

shibing624 commented 1 year ago

嗯

daimazz1 commented 1 year ago

你好我在单卡训练chatglm-6b时也遇到了这个错误，然后在PT阶段的train和eval加上了with torch.autocast("cuda"): 现在可以跑通了，但是我又测试了下bloom发现加了这个之后eval的 perplexity 2W+数据不正常，加这个会影响PT阶段训练模型的性能吗

zhr0313 commented 1 year ago

是的，我也发现这个问题了，加上之后的loss也降不下来。我重新安装了环境，可以解决这个问题。另外，将eval_step设置的很大，不进行eval，也可以解决这个问题，没发现对新模型有什么影响。

daimazz1 commented 1 year ago

是的，我也发现这个问题了，加上之后的loss也降不下来。我重新安装了环境，可以解决这个问题。另外，将eval_step设置的很大，不进行eval，也可以解决这个问题，没发现对新模型有什么影响。

你好这个应该跟环境没什么联系吧，你是在chatglm模型跑出现这个问题吗，加上了with torch.autocast("cuda"): 这个以后，调整rval_step为多大可以解决这个问题哈

zhr0313 commented 1 year ago

我在4V100的环境，最新库上运行是没问题的。2A100的环境、不是最新的意向库上运行出现这个问题，但A100环境库由于一些原因不太好改动，我也不确定是不是库的问题。不加with torch.autocast("cuda"): ，eval_step大于你的训练步数可以解决这个问题（expected scalar type Half but found Float）

shibing624 commented 1 year ago

refer https://github.com/mymusise/ChatGLM-Tuning/issues/179 and https://github.com/shibing624/MedicalGPT/issues/125

shibing624 / MedicalGPT

多卡chatglm2 sft报错RuntimeError: expected scalar type Half but found Float #60