Pytorch Lightning stuck the computer and finally killed

wayveai / fiery

PyTorch code for the paper "FIERY: Future Instance Segmentation in Bird's-Eye view from Surround Monocular Cameras"

https://wayve.ai/blog/fiery-future-instance-prediction-birds-eye-view

MIT License

560 stars 85 forks source link

Pytorch Lightning stuck the computer and finally killed #40

Open synsin0 opened 2 years ago

synsin0 commented 2 years ago

Thanks for your great work. I'd like to reproduce the training process, but I encountered an error. That is when I use multi-GPU distributed training process, the logging information seems normal, but afterwards the remote server stuck and connection reset and finally the process is killed. My remote server is an independent machine with 4xRTX3090. Is there any issues with the pytorch lightning distributed training that may cause my failure?

pranavi77 commented 2 years ago

It might be using more RAM. Check your RAM usage once you start running the code and reduce num_workers in config file. this might solve your issue.