Open ShoufaChen opened 2 years ago
Hi Shoufa,
Could you please send the exact command you are running for training?
This is indeed a NaN during the forward pass (hence losses are NaN), which looks like a divergence.
On Tue, Jul 12, 2022 at 12:34 AM Shoufa Chen @.***> wrote:
Hi, @unixpickle https://github.com/unixpickle
Thanks for your awesome work and open source.
I met the nan issue when training on ImageNet 128x128,
| lg_loss_scale | -1.62e+04 | | loss | nan | | loss_q0 | nan | | loss_q1 | nan | | loss_q2 | nan | | loss_q3 | nan | | mse | nan | | mse_q0 | nan | | mse_q1 | nan | | mse_q2 | nan | | mse_q3 | nan | | samples | 3.92e+07 | | step | 1.53e+05 | | vb | nan | | vb_q0 | nan | | vb_q1 | nan | | vb_q2 | nan | | vb_q3 | nan |
Found NaN, decreased lg_loss_scale to -16199.354 Found NaN, decreased lg_loss_scale to -16200.354 Found NaN, decreased lg_loss_scale to -16201.354 Found NaN, decreased lg_loss_scale to -16202.354 Found NaN, decreased lg_loss_scale to -16203.354
I used fp16. Did you meet similar issues?
Thanks in advance.
— Reply to this email directly, view it on GitHub https://github.com/openai/guided-diffusion/issues/50, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADDEBJI44CYURC4FKFL2G3VTUNXTANCNFSM53J7YVXA . You are receiving this because you were mentioned.Message ID: @.***>
Hi, @unixpickle
Thanks for your help.
My command:
MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"
OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
--nproc_per_node=8 --nnodes=4 --node_rank=$1 \
--master_addr=$CHIEF_IP --master_port=22268 \
--use_env scripts/image_train.py \
--data_dir /dev/shm/imagenet/train \
$MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS
I use 4 nodes, each of which has 8 GPUs.
Do you have a record of the loss before the NaN occurred? Did it spike right before NaNs started happening?
Your command itself looks good to me, so I don't think it's a simple hyperparameter issue. Also, have you tried looking at samples from before the divergence, as a sanity check that the model is actually learning correctly?
Perhaps this bug is related to the issue here: https://github.com/openai/guided-diffusion/issues/44
If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line
self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))
to something like this:
for p in self.master_params:
p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))
Thanks for your help.
I will patch this bug and try again. I will post my results in about 2 days.
The problem of NaNs still exists with this changing.
Perhaps this bug is related to the issue here: #44
If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line
self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))
to something like this:
for p in self.master_params: p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))
I am now at 430180 steps and don't meet NaN
.
That's so strange. I'm training a 256*256 model with batch size 256 and learning rate 1e-4 on 8 nodes. You say that you didn't meet NaNs. I wonder that the exact meaning of your commet is you didn't meet NaNs anymore or you didn't meet the problem of infinitely decreasing lg_loss_scale even if you met NaN? After applying the changing, my training log is still similar to the origin one. My training process is resumed from a partly trained model with about 300k iterations. While training, I met NaNs after thousands of iterations but it can be solved by decreasing the lg_loss_scale at most conditions. But the training will finally fail after about 10-20k iterations (decreasing lg_loss_scale) and I have to stop training and resume a new training process from the last normal checkpoint.
I am training a 128*128 ImageNet model.
The problem of NaNs still exists with this change.
Perhaps this bug is related to the issue here: #44 If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line
self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))
to something like this:
for p in self.master_params: p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))
me too, still have nan loss and the training fails when training on ImageNet 64*64
@ShoufaChen , hi, have you thoroughly solved this issue? have you got any nan loss and fail anymore?
Hello! I also had this problem. Did you solved it? In fact, I met this problem but the program still works. Maybe the loss is not broken yet.. But it told me that "Found Nan".
----------------------------
| lg_loss_scale | -909 |
| loss | 0.115 |
| loss_q0 | 0.261 |
| loss_q1 | 0.0599 |
| loss_q2 | 0.0339 |
| loss_q3 | 0.0241 |
| mse | 0.111 |
| mse_q0 | 0.25 |
| mse_q1 | 0.0594 |
| mse_q2 | 0.0336 |
| mse_q3 | 0.0237 |
| samples | 1.98e+03 |
| step | 990 |
| vb | 0.00385 |
| vb_q0 | 0.0104 |
| vb_q1 | 0.00048 |
| vb_q2 | 0.00031 |
| vb_q3 | 0.000323 |
----------------------------
Found NaN, decreased lg_loss_scale to -915.944
Found NaN, decreased lg_loss_scale to -916.944
Found NaN, decreased lg_loss_scale to -917.944
Found NaN, decreased lg_loss_scale to -918.944
Found NaN, decreased lg_loss_scale to -919.944
Found NaN, decreased lg_loss_scale to -920.944
Found NaN, decreased lg_loss_scale to -921.944
Found NaN, decreased lg_loss_scale to -922.944
Found NaN, decreased lg_loss_scale to -923.944
Looking forward to your reply.
@JawnHoan My solution is re-git the whole repo again and implement your own method...
I know it is not a good idea, but works for me.
@JawnHoan My solution is re-git the whole repo again and implement your own method...
I know it is not a good idea, but works for me.
Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists. And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop
@JawnHoan My solution is re-git the whole repo again and implement your own method... I know it is not a good idea, but works for me.
Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists. And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop
I think It is normal to find Nan during Mixed Precision Training and "decrease lg_loss_scale" is excatly the way of fixing the problem of Nan. However, if the program keeps finding Nan means that decreasing lg_loss_scale is not able to fix the problem.
Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?
_----------------------------
| grad_norm | 0.144 |
| lg_loss_scale | 23.3 |
| loss | 0.185 |
| loss_q0 | 0.285 |
| loss_q1 | 0.0296 |
| loss_q2 | 0.0139 |
| loss_q3 | 0.44 |
| mse | 0.0367 |
| mse_q0 | 0.147 |
| mse_q1 | 0.029 |
| mse_q2 | 0.0136 |
| mse_q3 | 0.00291 |
| param_norm | 303 |
| samples | 2.62e+04 |
| step | 3.27e+03 |
| vb | 0.148 |
| vb_q0 | 0.138 |
| vb_q1 | 0.000615 |
| vb_q2 | 0.000278 |
| vb_q3 | 0.437 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm | 0.13 |
| lg_loss_scale | 23.6 |
| loss | 0.0725 |
| loss_q0 | 0.205 |
| loss_q1 | 0.0294 |
| loss_q2 | 0.0108 |
| loss_q3 | 0.00471 |
| mse | 0.0481 |
| mse_q0 | 0.127 |
| mse_q1 | 0.0288 |
| mse_q2 | 0.0105 |
| mse_q3 | 0.00452 |
| param_norm | 307 |
| samples | 3.71e+04 |
| step | 4.64e+03 |
| vb | 0.0245 |
| vb_q0 | 0.0776 |
| vb_q1 | 0.00059 |
| vb_q2 | 0.00021 |
| vb_q3 | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...
Hi, @unixpickle
Thanks for your help.
My command:
MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True" DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear" TRAIN_FLAGS="--lr 1e-4 --batch_size 8" OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \ --nproc_per_node=8 --nnodes=4 --node_rank=$1 \ --master_addr=$CHIEF_IP --master_port=22268 \ --use_env scripts/image_train.py \ --data_dir /dev/shm/imagenet/train \ $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS
I use 4 nodes, each of which has 8 GPUs.
Hi, how do you achieve multi node multi GPU training, do you changes the code? I try multi-node-multi-GPU on other programe, but I failed, because the slow commucation between different nodes, do you notice this, can you share some experience of multi-node-multi-GPU training?
Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?
_---------------------------- | grad_norm | 0.144 | | lg_loss_scale | 23.3 | | loss | 0.185 | | loss_q0 | 0.285 | | loss_q1 | 0.0296 | | loss_q2 | 0.0139 | | loss_q3 | 0.44 | | mse | 0.0367 | | mse_q0 | 0.147 | | mse_q1 | 0.029 | | mse_q2 | 0.0136 | | mse_q3 | 0.00291 | | param_norm | 303 | | samples | 2.62e+04 | | step | 3.27e+03 | | vb | 0.148 | | vb_q0 | 0.138 | | vb_q1 | 0.000615 | | vb_q2 | 0.000278 | | vb_q3 | 0.437 | ---------------------------- Found NaN, decreased lg_loss_scale to 22.278000000004006 ... ....(normal) ....(normal) ... (normal) ... ---------------------------- | grad_norm | 0.13 | | lg_loss_scale | 23.6 | | loss | 0.0725 | | loss_q0 | 0.205 | | loss_q1 | 0.0294 | | loss_q2 | 0.0108 | | loss_q3 | 0.00471 | | mse | 0.0481 | | mse_q0 | 0.127 | | mse_q1 | 0.0288 | | mse_q2 | 0.0105 | | mse_q3 | 0.00452 | | param_norm | 307 | | samples | 3.71e+04 | | step | 4.64e+03 | | vb | 0.0245 | | vb_q0 | 0.0776 | | vb_q1 | 0.00059 | | vb_q2 | 0.00021 | | vb_q3 | 0.000184 | ---------------------------- Found NaN, decreased lg_loss_scale to 22.641000000005672_ ... ...
@fido20160817 it is normal, no worries about it
Thanks!🤝
@JawnHoan hi, if you still have this issue, I suggest you decrease the learning rate.
In my experiments, I use batch=128 for Imagenet64, lr=1e-4 cause this nan issue. Therefore, I changed the learning rate from 1e-4 to 3e-5, problem solved. Hope this will be helpful
@forever208 Hello, may I add your contact information to ask some questions? Thank you.
Hi @ONobody, of course, my email: m.ning@uu.nl
Hello! I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.
Hello! I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.
about 0.055. ImageNet is the most time-consuming dataset to train, I suggest you first try with cifar10 or LSUN datasets.
Self-promotion: our ICML 2023 paper DDPM-IP shows an extremely easy way to dramatically improve the FID and training speed based on guided-diffusion, feel free to take a look.
Thanks a lot.
Hi, @unixpickle
Thanks for your awesome work and open source.
I met the
nan
issue when training on ImageNet 128x128,I used fp16. Did you meet similar issues?
Thanks in advance.