Hello, could you please help me understand why I found that when I followed the approach in this issue, the multi-GPU training runs correctly on a V100 machine, but when I run the same code on a machine with four 3090 GPUs, I encounter the error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!

rui-ye / OpenFedLLM

Apache License 2.0

323 stars 50 forks source link

Hello, could you please help me understand why I found that when I followed the approach in this issue, the multi-GPU training runs correctly on a V100 machine, but when I run the same code on a machine with four 3090 GPUs, I encounter the error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! #31

Open wizaaaard opened 1 week ago

wizaaaard commented 1 week ago

          Oops. Sorry for that. (according to experience of other researchers that have used our code) You may also need to set the `device_map='auto'` for the https://github.com/rui-ye/OpenFedLLM/blob/427aec52f068860a835244563dd4f9b48bf06f00/main_sft.py#L34

Originally posted by @rui-ye in https://github.com/rui-ye/OpenFedLLM/issues/21#issuecomment-2176527114

rui-ye commented 2 days ago

Did you fix that now? This looks like a strange issue and we did not met such issue before.