First, thank you for sharing the code,I now foresee a problem in the code train_AEI.py during training
lossD: nan lossG: nan batch_time: 1.4538311958312988s
L_adv: nan L_id: 1.0066417455673218 L_attr: 0.1159457415342331 L_rec: 0.5412249565124512
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Traceback (most recent call last):
File "train_AEI.py", line 160, in
scaled_loss.backward()
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/contextlib.py", line 120, in exit
next(self.gen)
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/_process_optimizer.py", line 131, in post_backward_models_are_masters
scaler.unscale_with_stashed(
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/scaler.py", line 183, in unscale_with_stashed
out_scale/grads_have_scale,
ZeroDivisionError: float division by zero
May I ask what causes this? Looking forward to your reply
First, thank you for sharing the code,I now foresee a problem in the code train_AEI.py during training
lossD: nan lossG: nan batch_time: 1.4538311958312988s L_adv: nan L_id: 1.0066417455673218 L_attr: 0.1159457415342331 L_rec: 0.5412249565124512 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0 Traceback (most recent call last): File "train_AEI.py", line 160, in
scaled_loss.backward()
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/contextlib.py", line 120, in exit
next(self.gen)
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/_process_optimizer.py", line 131, in post_backward_models_are_masters
scaler.unscale_with_stashed(
File "/miniconda/miniconda3/envs/pytorch11.3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/scaler.py", line 183, in unscale_with_stashed
out_scale/grads_have_scale,
ZeroDivisionError: float division by zero
May I ask what causes this? Looking forward to your reply