Training freezes at the same iteration with no error

Hi! I'm trying to run your code on the Church dataset with the command:

python -m experiments church train church_default with batch_size=16, num_gpus=4

The training freezes at the 888000-th iteration with the following message:

(iters: 888000, data: 0.000, train: 0.050, maintenance: 0.000) D_R1: 0.089 D_mix: 0.292 D_real: 0.597 D_rec: 0.290 D_total: 2.495 G_GAN_mix: 0.934 G_GAN_rec: 0.467 G_L1: 0.211 G_mix: 0.805 L1_dist: 0.211 PatchD_mix: 0.652 PatchD_real: 0.659

Training doesn’t go further after this iteration and just freezes with no error. I’ve also tried to run training on the Bedrooms dataset with batch_size=32, num_gpus=8. In addition, I’ve tried to run training in the single gpu setup on both datasets. In all cases the 888000-th number of «freezing» iteration and behavior were the same. Checkpoints and shapshots aren’t saved after this iteration as well. This justifies that the script doesn’t continue training.

What could be the possible reason for such behavior? Thank you in advance.

taesungp / swapping-autoencoder-pytorch

Training freezes at the same iteration with no error #30