rui-ye / OpenFedLLM

Apache License 2.0
335 stars 51 forks source link

Does the framework support multi-gpu training? #21

Open bihaizhang opened 5 months ago

bihaizhang commented 5 months ago

Thanks for your brilliant work. I would like to do SFT with multiple GPUs. Does your framework support this feature by design or I need to make some modifications?

rui-ye commented 5 months ago

Hi, thanks! Yes, for example, setting CUDA_VISIBLE_DEVICES=0,1 will run you code on devices 0 and 1.

bihaizhang commented 5 months ago

Thanks for your reply! After setting CUDA_VISIBLE_DEVICES=0,1, the code still runs on only one GPU. Is there any other modification required?

rui-ye commented 5 months ago

Oops. Sorry for that. (according to experience of other researchers that have used our code) You may also need to set the device_map='auto' for the https://github.com/rui-ye/OpenFedLLM/blob/427aec52f068860a835244563dd4f9b48bf06f00/main_sft.py#L34

imamtom commented 3 months ago

Oops. Sorry for that. (according to experience of other researchers that have used our code) You may also need to set the device_map='auto' for the

https://github.com/rui-ye/OpenFedLLM/blob/427aec52f068860a835244563dd4f9b48bf06f00/main_sft.py#L34

It wroks, thx

wizaaaard commented 1 month ago

Oops. Sorry for that. (according to experience of other researchers that have used our code) You may also need to set the device_map='auto' for the https://github.com/rui-ye/OpenFedLLM/blob/427aec52f068860a835244563dd4f9b48bf06f00/main_sft.py#L34

It wroks, thx Hello, could you please help me understand why I found that when I followed the approach in this issue, the multi-GPU training runs correctly on a V100 machine, but when I run the same code on a machine with four 3090 GPUs, I encounter the error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!