Closed zhaoyang02 closed 3 months ago
There is a solution here, and the problem has been resolved. https://discuss.huggingface.co/t/getting-torch-cuda-halftensor-error-while-using-deepspeed-with-accelerate/39997/6
There is a solution here, and the problem has been resolved. https://discuss.huggingface.co/t/getting-torch-cuda-halftensor-error-while-using-deepspeed-with-accelerate/39997/6
@lintao-common, thanks for sharing solution. Closing this issue.
Originally posted by @NickyMouseSG in https://github.com/microsoft/DeepSpeed/issues/550#issuecomment-1722239501
My task is trying to test DPO training of llama3-8b model on Bridges-2 Platform with 16 V100-32GB GPUs,which don't support bf16, so I set fp16: true to use mixed-precision. The code is based on alignment-handbook
While using deepspeed 0.12.2 with mixed-precision: fp16, the model input_ids will be turned to torch.float16, which should be Int or Long. The deepspeed3 config file:
The traceback:
I‘ve added three print functions before the model gets input_ids: trl dpo_trainer.py:
transformers modeling_llama.py:
and I got
When I add a 'input_ids=input_ids.long()' before the model gets it, I don't get the same input_ids type error. But instead I got a bug:
I've tested several different versions of trl or transformers and I got the same issues. So I think the bug is caused by deepspeed.
Thanks!