"nan" losses issue for some small subset of users

nerdyrodent / VQGAN-CLIP

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Other

2.61k stars 428 forks source link

"nan" losses issue for some small subset of users #151

Closed monsieurpooh closed 2 years ago

monsieurpooh commented 2 years ago

Example last few iterations of a user for whom it's not working:

47it [01:22,  1.60s/it]
48it [01:24,  1.70s/it]
49it [01:26,  1.67s/it]
50it [01:27,  1.64s/it]

50it [01:28,  1.64s/it]
i: 0, loss: nan, losses: nan
i: 50, loss: nan, losses: nan

Example last few iterations for my PC:

[e] 48it [00:16,  2.77it/s]
[e] 49it [00:16,  2.80it/s]
[e] 50it [00:16,  2.94it/s]
[e]                        
[e] 50it [00:17,  2.94it/s]
i: 0, loss: 0.92412, losses: 0.92412
i: 50, loss: 0.765271, losses: 0.765271

I have no clue how to even begin debugging this

monsieurpooh commented 2 years ago

We confirmed that on his computer running the exact same code as mine, the loss always says "Nan" whereas mine is a number. I suspect localization issues/assumptions somewhere deep within the python code or libraries. Does anyone have any idea on how to begin debugging this? It doesn't repro on my machine.

monsieurpooh commented 2 years ago

btw, the reason this is an issue is that on these machines where the program fails, the image never gets any more image-like than the seed image. It's always a blotchy seed-like image no matter how many iterations are run.

monsieurpooh commented 2 years ago

Update: We've narrowed down the problem to something that occurs on the line

iii = perceptor.encode_image(normalize(make_cutouts(out))).float()

"out" variable is regular numbers but "iii" variable is all "nan". Will update more after adding more debugging statements and having him run the debugging again to narrow it down further.

monsieurpooh commented 2 years ago

Something happens inside "encode_image". I dug into CLIP/clip/model.py and put debugging statements. Inside the "forward" method of VisionTransformer, there's a series of transfofrmation of the variable "x". x contains "nan" after the line self.conv1(x). Then magically it no longer has "nan" after the line x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1). Then for some inconceivable reason, it contains nan again, after the line self.transformer(x). I must reiterate this is only reproducible on the other user's machine and I can't reproduce it on my end. On my machine (and most people's machines), it never contains nan.

This is very hard to debug; I beg anyone with knowledge of this system to chime in.

monsieurpooh commented 2 years ago

I've now narrowed it down to _conv_forward in torch/nn/modules/conv.py. The line of code is:

F.conv2d(input, weight, bias, self.stride, self.padding, self.dilation, self.groups)

If bias is None, the returned tensor has nan for the users who suffer from this bug. This doesn't happen for everyone; it only happens to about 1% of users.

If bias is not None, there is no nan in either case.

monsieurpooh commented 2 years ago

More updates: Calling conv2d with bias True didn't solve the issue. Neither did updating to pytorch 1.11.0.

monsieurpooh commented 2 years ago

Update: It might be related to: https://github.com/pytorch/pytorch/issues/58123 https://github.com/openai/glide-text2im/issues/31 https://discuss.pytorch.org/t/half-precision-convolution-cause-nan-in-forward-pass/117358/3 https://github.com/pytorch/pytorch/issues/69449 https://github.com/ultralytics/yolov5/issues/5815

monsieurpooh commented 2 years ago

A possible fix might be to update above cuDNN 8.2.2. Note that by default even the absolute latest pytorch library includes something like cuDNN 8.2. So after installing pytorch, download cuDNN separately and patch in the DLL files. Will comment later with updates.

monsieurpooh commented 2 years ago

Updating above cuDNN 8.2.2 fixed the issue!

nerdyrodent commented 2 years ago

Glad you got it sorted!