Closed Ragul-Ramdass closed 3 weeks ago
Hi @Ragul-Ramdass -- thank you for reporting this issue and the one in #3783 -- please give us a few business days to look into it and get back to you (I left a similar message in the above mentioned issue as well). Thank you.
Hi @alexsherstinsky, Thanks for looking into it, Please let me know if you need any other information. My aim is achieving distributed training using deepspeed in ludwig, if you can suggest any work around that would also be great. Thanks
Hi I'm trying to do a distributed training on llama-7b in a VM having two Tesla T4 GPU's using native deepspeed. I'm facing the following error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!"
My current OS is ubuntu :20.04 python version: 3.10.13 model.yaml:
Environment:
Can you guide me in solving this Thanks in advance!!