johannbrehmer / manifold-flow

Manifold-learning flows (ℳ-flows)
https://arxiv.org/abs/2003.13913
MIT License
230 stars 27 forks source link

NaN error #12

Open zaocan666 opened 2 years ago

zaocan666 commented 2 years ago

Hi, excellent work here. I encountered NaN error when training with the config configs/train_mf_gan64d_april.config: Traceback (most recent call last): File "/home/urkax/project/GenFed/manifold-flow-public/experiments/train.py", line 592, in <module> learning_curves = train_model(args, dataset, model, simulator) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/train.py", line 504, in train_model learning_curves = train_manifold_flow_sequential(args, dataset, model, simulator) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/train.py", line 276, in train_manifold_flow_sequential learning_curves = trainer1.train( File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 307, in train loss_train, loss_val, loss_contributions_train, loss_contributions_val = self.epoch( File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 380, in epoch batch_loss, batch_loss_contributions = self.batch_train( File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 513, in batch_train loss_contributions = self.forward_pass(batch_data, loss_functions, forward_kwargs=forward_kwargs, custom_kwargs=custom_kwargs) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 633, in forward_pass self._check_for_nans("Reconstructed data", x_reco) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 122, in _check_for_nans raise NanException training.trainer.NanException

I am using 5 GPUs, pytorch 1.7.1 Have you ever encountered such problem?

zaocan666 commented 2 years ago

image

zaocan666 commented 2 years ago

I find that this occurs only when I use multiple GPUs for training, but I do not know why

Seven-year-promise commented 2 years ago

I trained the MSE on all parameters instead of some part of parameters (when training celeba_emf). It somehow works. And decreasing the learning rate also helps.