Thanks for your great work. I'd like to reproduce the training process, but I encountered an error. That is when I use multi-GPU distributed training process, the logging information seems normal, but afterwards the remote server stuck and connection reset and finally the process is killed. My remote server is an independent machine with 4xRTX3090. Is there any issues with the pytorch lightning distributed training that may cause my failure?
Thanks for your great work. I'd like to reproduce the training process, but I encountered an error. That is when I use multi-GPU distributed training process, the logging information seems normal, but afterwards the remote server stuck and connection reset and finally the process is killed. My remote server is an independent machine with 4xRTX3090. Is there any issues with the pytorch lightning distributed training that may cause my failure?