Open thomasjpfan opened 3 years ago
Thanks for reporting. Do you know how this is solved more generally (say, only using PyTorch without any frameworks)? I could imagine that similar errors can occur easily, given how tricky multi-threading is in general. Unfortunately, I don't have access to a setup to experiment with this.
I do not have a setup to experiment with this either. I've seen two solutions.
BTW there are a bunch of barrier calls in this file to handle the distributed case.
Since we do not have the resources to test DDP, I think it would be hard to officially support it.
I know too little to really comment on that. Ideally, I would wish for skorch to get out of the way enough that users can use DistributedDataParallel
if they wish so. Regarding barriers, is that something that could be achieved through callbacks?
I had a recent conversation with a user that tried to use
DistributedDataParallel
withskorch
's early stopping and this would cause the process to hang. My guess is that since ddp workers spawn their own jobs, skorch's early stopping mechanism would stop a worker, but the parent node would not get this information. This leaves the parent waiting for a child that has stopping running.There may also be an issue with checking the validation loss with
DistributedDataParallel
, because each worker would have its own loss, and this would need to be gathered to actually compute the loss for a given epoch.