Find an error when training decoder

YUHANG-Ma commented 2 years ago

Hi, Thanks for your code! However, I find a bug in the train_decoder.py file. In line 130 The generator can't generate img in 64 size, the error is shown as below 11373ad3bf58427ebc20f0a32d2cc9c I changed the code into generated_images = [transforms.functional.resize(i, [128,128]) for i in samples]

and it make sense. I think maybe it is a bug in the original code if the input image size is resized to 128?

Also, I found an error when training. I am training the decoder without text condition. My dataset format is {}.tar and embedding.npy. Should I set the text condition to false when doing the validaton like this? ef05452ec57cd7cb403b15b2acc71cd And I found the loss to be nan when training, I don't know why, Looking forward to your reply.

Veldrovive commented 2 years ago

The dataloader code is very inflexible at the moment since we are still just trying to get it to work. It assumes certain things about the dataset such as there being text captions since it is built with LAION 2B in mind. I believe your workaround should be fine, but I will gather some of the issues people are having in a later pull request.

The sample generation handles any square image, but the config file can be a bit difficult to handle. It appears that you are loading your data as 128x128 and then generating 64x64 images which would cause an error. I would guess that you need to change the preprocessing config to have a resize down to 64x64 like this

"RandomResizedCrop": {
  "size": [64, 64],
  "scale": [0.75, 1.0],
  "ratio": [1.0, 1.0]
},

Or it could be the other way around. If you have set the unets to generate 128x128 images, you also need to change the dataloader to load 128x128 images by setting "size": [128, 128]. If that isn't the issue, I would need your config file to debug further.

As for nan loss, that can be caused by a few things. The most basic is the possibility that the data has an error in it. If you have checked that there are some other causes we have found, but haven't had time to explore and fix. There appears to be a strange issue with smaller decoders where a divide by zero error happens quickly after training starts which causes the loss to become nan. A learning rate above 5e-5 also appears to cause a similar issue.

YUHANG-Ma commented 2 years ago

Thanks for your reply. For the first question, I have fixed it. For the nan loss one, I am trying to set lr as 1e-5 instead of 1e-3 and waiting for the result. Also, I have another question. What is the difference between adding text embedding to decoder training process and without text embedding? If I set "with_text" as false, it comes an arror shown as below. I also need to delete 'txt' manually like shown in the figure.

Will text embedding affact the training process of the decoder part? If so, why didn't the repo add text embedding as input during the data process? I only see text embedding in the validation part.

Veldrovive commented 2 years ago

Ah, I see. I thought I had made that more flexible. That is just for captioning the image since text embeddings were found not have much effect with DALLE2 although imagen calls that into question. To make it work you can simply remove that txt and just leave a tuple decomposition of img and emb. The txt isn't actually used in the validation loop anymore so that should work.

YUHANG-Ma commented 2 years ago

Thanks! It makes sense a lot! The model is training now and I will see if it still occurs 'nan' loss value and maybe I will keep updating the progress. Besides, Have a nice day : )

lucidrains commented 2 years ago

The dataloader code is very inflexible at the moment since we are still just trying to get it to work. It assumes certain things about the dataset such as there being text captions since it is built with LAION 2B in mind. I believe your workaround should be fine, but I will gather some of the issues people are having in a later pull request.

The sample generation handles any square image, but the config file can be a bit difficult to handle. It appears that you are loading your data as 128x128 and then generating 64x64 images which would cause an error. I would guess that you need to change the preprocessing config to have a resize down to 64x64 like this
"RandomResizedCrop": {
  "size": [64, 64],
  "scale": [0.75, 1.0],
  "ratio": [1.0, 1.0]
},
Or it could be the other way around. If you have set the unets to generate 128x128 images, you also need to change the dataloader to load 128x128 images by setting "size": [128, 128]. If that isn't the issue, I would need your config file to debug further.

As for nan loss, that can be caused by a few things. The most basic is the possibility that the data has an error in it. If you have checked that there are some other causes we have found, but haven't had time to explore and fix. There appears to be a strange issue with smaller decoders where a divide by zero error happens quickly after training starts which causes the loss to become nan. A learning rate above 5e-5 also appears to cause a similar issue.

i think the Decoder should be able to auto resize any images to the proper dimensions during training. the issue seems to be the validation loop is trying to make_grid on the original image (128) vs the sampled (64), but perhaps we need to intercept the make_grid and just conform all images to the smallest dimensions first before they are stacked into a grid

Veldrovive commented 2 years ago

I think that makes sense. There are cases in which the image should be loaded in at a different dimension than the final unet size so it would not make sense to enforce that the dataloader resizes to the final unet size. There might be other cases in the future where something breaks because of this type of behavior but I think it has to be fixed on a case by case basis.

YUHANG-Ma commented 2 years ago

Do you mean to let the generate img size fit the original one or the opposite? What I have done is to resize the image into 128 to fit the input img size in generated_images function and then transfer to make_grid function.

Veldrovive commented 2 years ago

It is more representative of the actual generated image to resize the image from the dataloader instead of the one sampled from the decoder so I would resize real_images to match generated_images.

lucidrains commented 2 years ago

@YUHANG-Ma let me know if this helps with the error you see above https://github.com/lucidrains/DALLE2-pytorch/commit/9025345e2984b76f5641b6347e2f32a068121cde

lucidrains commented 2 years ago

@YUHANG-Ma also put in a fix for your other issue here https://github.com/lucidrains/DALLE2-pytorch/commit/f8bfd3493af8881cc2e1c4402a516ecfd26c0e55 (feel free to close the issue if both are resolved)

YUHANG-Ma commented 2 years ago

Hi, I still meet this issue that the training loss is nan. It is not none when the training is beginning but it changed to nan when I am training the second epoah. I think it is not because of error of my dataset. I changed the lr into 1e-5 but it still doesn't work.

Veldrovive commented 2 years ago

I opened an issue for it with some details about where the nan is coming from #138. I've gotta go to sleep now though so I can't look further into it.

lucidrains commented 2 years ago

@YUHANG-Ma I would just turn off the learned variance learned_variance=False I don't think it is important

lucidrains commented 2 years ago

Another paper also says setting the beta2 on Adam down to 0.99 has stabilizing effect

lucidrains commented 2 years ago

@YUHANG-Ma try setting this to True https://github.com/lucidrains/DALLE2-pytorch/commit/ffd342e9d06acf3d28165609fe927e2f6be6498a#diff-038ecede954c29266888cb88e37cc06f61ba2433f8b5142c7b2b2cefde5ed0edR1748 if that doesn't work, just turn off learned variance and see if that helps

Veldrovive commented 2 years ago

I've tried all the options laid out and the only one that seems to solve the problem consistently is to turn off learned variance. The other solutions seem to reduce the probability of it happening, but some of the seeds I am testing still have the problem. It seems like learned_variance_constrain_frac = true in the unet config makes it much less likely, but it still happens infrequently.

lucidrains commented 2 years ago

@Veldrovive thanks for confirming that learned variance is the problem!

lucidrains commented 2 years ago

@Veldrovive it really should make little difference to turn it off based on what we know

the size of the text encoder (plus dynamic thresholding) makes the most difference for text / image alignment, as Imagen figured out. and the cascading DDPM does most of the work of making the image look nice. whatever contributions learned variance had would only be in the non-cascading works of the past

lucidrains commented 2 years ago

@YUHANG-Ma also decided to lower the beta2 for Adam down to 0.99 since that's what the authors of https://openreview.net/forum?id=2LdBqxc1Yv claim helps with stable training. still, DDPMs are only a 2 year old technology at this point, so findings from papers can end up just being superstitions. we can just quickly test it and move on if it does not work

lucidrains commented 2 years ago

Screenshot from 2022-06-03 10-29-41

YUHANG-Ma commented 2 years ago

I am trying to change learned variance to false to see if it works. And I will keep updating. beta2 is already setted to 0.99 now. Many thanks!

lucidrains commented 2 years ago

@YUHANG-Ma working for you now?

lucidrains / DALLE2-pytorch

Find an error when training decoder #135