nan occurs when training ImageNet 128x128

ShoufaChen commented 2 years ago

Hi, @unixpickle

Thanks for your awesome work and open source.

I met the nan issue when training on ImageNet 128x128,

-----------------------------
| lg_loss_scale | -1.62e+04 |
| loss          | nan       |
| loss_q0       | nan       |
| loss_q1       | nan       |
| loss_q2       | nan       |
| loss_q3       | nan       |
| mse           | nan       |
| mse_q0        | nan       |
| mse_q1        | nan       |
| mse_q2        | nan       |
| mse_q3        | nan       |
| samples       | 3.92e+07  |
| step          | 1.53e+05  |
| vb            | nan       |
| vb_q0         | nan       |
| vb_q1         | nan       |
| vb_q2         | nan       |
| vb_q3         | nan       |
-----------------------------
Found NaN, decreased lg_loss_scale to -16199.354
Found NaN, decreased lg_loss_scale to -16200.354
Found NaN, decreased lg_loss_scale to -16201.354
Found NaN, decreased lg_loss_scale to -16202.354
Found NaN, decreased lg_loss_scale to -16203.354

I used fp16. Did you meet similar issues?

Thanks in advance.

unixpickle commented 2 years ago

Hi Shoufa,

Could you please send the exact command you are running for training?

This is indeed a NaN during the forward pass (hence losses are NaN), which looks like a divergence.

On Tue, Jul 12, 2022 at 12:34 AM Shoufa Chen @.***> wrote:

Hi, @unixpickle https://github.com/unixpickle

Thanks for your awesome work and open source.

I met the nan issue when training on ImageNet 128x128,

| lg_loss_scale | -1.62e+04 | | loss | nan | | loss_q0 | nan | | loss_q1 | nan | | loss_q2 | nan | | loss_q3 | nan | | mse | nan | | mse_q0 | nan | | mse_q1 | nan | | mse_q2 | nan | | mse_q3 | nan | | samples | 3.92e+07 | | step | 1.53e+05 | | vb | nan | | vb_q0 | nan | | vb_q1 | nan | | vb_q2 | nan | | vb_q3 | nan |

Found NaN, decreased lg_loss_scale to -16199.354 Found NaN, decreased lg_loss_scale to -16200.354 Found NaN, decreased lg_loss_scale to -16201.354 Found NaN, decreased lg_loss_scale to -16202.354 Found NaN, decreased lg_loss_scale to -16203.354

I used fp16. Did you meet similar issues?

Thanks in advance.

— Reply to this email directly, view it on GitHub https://github.com/openai/guided-diffusion/issues/50, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADDEBJI44CYURC4FKFL2G3VTUNXTANCNFSM53J7YVXA . You are receiving this because you were mentioned.Message ID: @.***>

ShoufaChen commented 2 years ago

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

unixpickle commented 2 years ago

Do you have a record of the loss before the NaN occurred? Did it spike right before NaNs started happening?

Your command itself looks good to me, so I don't think it's a simple hyperparameter issue. Also, have you tried looking at samples from before the divergence, as a sanity check that the model is actually learning correctly?

unixpickle commented 2 years ago

Perhaps this bug is related to the issue here: https://github.com/openai/guided-diffusion/issues/44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line

self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))

to something like this:

for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

ShoufaChen commented 2 years ago

Thanks for your help.

I will patch this bug and try again. I will post my results in about 2 days.

realPasu commented 2 years ago

The problem of NaNs still exists with this changing.

Perhaps this bug is related to the issue here: #44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line
self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))
to something like this:
for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

ShoufaChen commented 2 years ago

I am now at 430180 steps and don't meet NaN.

realPasu commented 2 years ago

That's so strange. I'm training a 256*256 model with batch size 256 and learning rate 1e-4 on 8 nodes. You say that you didn't meet NaNs. I wonder that the exact meaning of your commet is you didn't meet NaNs anymore or you didn't meet the problem of infinitely decreasing lg_loss_scale even if you met NaN? After applying the changing, my training log is still similar to the origin one. My training process is resumed from a partly trained model with about 300k iterations. While training, I met NaNs after thousands of iterations but it can be solved by decreasing the lg_loss_scale at most conditions. But the training will finally fail after about 10-20k iterations (decreasing lg_loss_scale) and I have to stop training and resume a new training process from the last normal checkpoint.

ShoufaChen commented 2 years ago

I am training a 128*128 ImageNet model.

forever208 commented 2 years ago

The problem of NaNs still exists with this change.

Perhaps this bug is related to the issue here: #44 If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line
self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))
to something like this:
for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

me too, still have nan loss and the training fails when training on ImageNet 64*64

forever208 commented 2 years ago

@ShoufaChen , hi, have you thoroughly solved this issue? have you got any nan loss and fail anymore?

HoJ-Onle commented 2 years ago

Hello! I also had this problem. Did you solved it? In fact, I met this problem but the program still works. Maybe the loss is not broken yet.. But it told me that "Found Nan".

----------------------------
| lg_loss_scale | -909     |
| loss          | 0.115    |
| loss_q0       | 0.261    |
| loss_q1       | 0.0599   |
| loss_q2       | 0.0339   |
| loss_q3       | 0.0241   |
| mse           | 0.111    |
| mse_q0        | 0.25     |
| mse_q1        | 0.0594   |
| mse_q2        | 0.0336   |
| mse_q3        | 0.0237   |
| samples       | 1.98e+03 |
| step          | 990      |
| vb            | 0.00385  |
| vb_q0         | 0.0104   |
| vb_q1         | 0.00048  |
| vb_q2         | 0.00031  |
| vb_q3         | 0.000323 |
----------------------------
Found NaN, decreased lg_loss_scale to -915.944
Found NaN, decreased lg_loss_scale to -916.944
Found NaN, decreased lg_loss_scale to -917.944
Found NaN, decreased lg_loss_scale to -918.944
Found NaN, decreased lg_loss_scale to -919.944
Found NaN, decreased lg_loss_scale to -920.944
Found NaN, decreased lg_loss_scale to -921.944
Found NaN, decreased lg_loss_scale to -922.944
Found NaN, decreased lg_loss_scale to -923.944

Looking forward to your reply.

forever208 commented 2 years ago

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

HoJ-Onle commented 2 years ago

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists. And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

ZGCTroy commented 2 years ago

@JawnHoan My solution is re-git the whole repo again and implement your own method... I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists. And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

I think It is normal to find Nan during Mixed Precision Training and "decrease lg_loss_scale" is excatly the way of fixing the problem of Nan. However, if the program keeps finding Nan means that decreasing lg_loss_scale is not able to fix the problem.

fido20160817 commented 2 years ago

Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

fido20160817 commented 2 years ago

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

Hi, how do you achieve multi node multi GPU training, do you changes the code? I try multi-node-multi-GPU on other programe, but I failed, because the slow commucation between different nodes, do you notice this, can you share some experience of multi-node-multi-GPU training?

forever208 commented 2 years ago

Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

@fido20160817 it is normal, no worries about it

fido20160817 commented 2 years ago

Thanks！🤝

forever208 commented 2 years ago

@JawnHoan hi, if you still have this issue, I suggest you decrease the learning rate.

In my experiments, I use batch=128 for Imagenet64, lr=1e-4 cause this nan issue. Therefore, I changed the learning rate from 1e-4 to 3e-5, problem solved. Hope this will be helpful

ONobody commented 1 year ago

@forever208 Hello, may I add your contact information to ask some questions? Thank you.

forever208 commented 1 year ago

Hi @ONobody, of course, my email: m.ning@uu.nl

hxy-123-coder commented 1 year ago

Hello！ I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

forever208 commented 1 year ago

Hello！ I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

about 0.055. ImageNet is the most time-consuming dataset to train, I suggest you first try with cifar10 or LSUN datasets.

Self-promotion: our ICML 2023 paper DDPM-IP shows an extremely easy way to dramatically improve the FID and training speed based on guided-diffusion, feel free to take a look.

hxy-123-coder commented 1 year ago

Thanks a lot.

openai / guided-diffusion

nan occurs when training ImageNet 128x128 #50