mit-han-lab / data-efficient-gans

[NeurIPS 2020] Differentiable Augmentation for Data-Efficient GAN Training
https://arxiv.org/abs/2006.10738
BSD 2-Clause "Simplified" License
1.27k stars 175 forks source link

ERROR: terminate called after throwing an instance of 'std::system_error' #80

Closed thusinh1969 closed 3 years ago

thusinh1969 commented 3 years ago

Hi,

I suddently experienced this error: terminate called after throwing an instance of 'std::system_error' what(): open(/home/nguyen/tmp/tmp_gaq61e3/.torch_distributed_init): No such file or directory

Everything worked fine so far, we went to 5 kimg OK and suddenly met this error. Nothing has been changed as far as environment is concerned and the original StyleGANV2-ADA ran fine as well, same conda environment.


tick 6 kimg 24.0 time 25m 40s sec/tick 147.6 sec/kimg 36.90 maintenance 0.3 cpumem 5.44 gpumem 5.18 augment 0.000
terminate called after throwing an instance of 'std::system_error' what(): open(/home/nguyen/tmp/tmp_gaq61e3/.torch_distributed_init): No such file or directory tick 7 kimg 28.0 time 28m 08s sec/tick 148.3 sec/kimg 37.08 maintenance 0.3 cpumem 5.44 gpumem 5.18 augment 0.000 tick 8 kimg 32.0 time 30m 37s sec/tick 148.7 sec/kimg 37.18 maintenance 0.3 cpumem 5.44 gpumem 5.18 augment 0.000


Weird thing is, the training somehow keeps going, no stopping, I am just worried that this may lead to something bad later.

System:

Any idea ? Steve

zsyzzsoft commented 3 years ago

Haven't seen this error...

TimRieber commented 2 years ago

did you solve this? I am encountering the same issue right now.