keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.42k stars 82 forks source link

torch.distributed.elastic.multiprocessing.api.SignalException #19

Closed shuuchen closed 1 year ago

shuuchen commented 1 year ago

Hi,

Thanks for your great work!

I trained with your code but always got the above exception after training for 1 or 2 hours. I searched and found that was because the terminal window was closed even the model was training in the background with nohup.

Have you also met the same problem? and I just wonder how did you train the models in the background?

Thank you!

keyu-tian commented 1 year ago

Thanks!

Sorry but i got no idea about this SignalException. We use tmux (similar to nohup) to train the models in the background. Maybe you could try it too.

shuuchen commented 1 year ago

Thank you. I tried tmux and it worked!