Open williamyuanv0 opened 2 years ago
hey @Jzhou0 SLM Lab wasn't written with distributed training across GPUs in mind. However I think you could do so with:
net_spec
for GPU assignment as you need."type": "YourConvNet",
with the net spec values.And the algorithm should just be able to pick it up. Depending on which algorithm the loss computation is going to use data from different devices so you'd need to make sure the correct device transfer happens on your net class implementation. But again certain things might break when you're training something so big across device - so definitely watch out for that. Let me know how it goes!
Hi, kengz, I meet a problem on how to run on multiple GPUs? In the initial of class
ConvNet
in conv.py, the code assigned device as follows:self.to(self.device)
but how to extent to multi GPUs here( in initial of classConvNet
) , or for an instantiation of classConvNet
. When I try to usetorch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)
to assign to multi GPUs, there is a problem that some (public) methods or variables definition in classConvNet
will lose afterconv_mode=torch.nn.DataParallel(conv_mode, device_ids={1,2,3,4})
.