Closed monsieurpooh closed 2 years ago
We confirmed that on his computer running the exact same code as mine, the loss always says "Nan" whereas mine is a number. I suspect localization issues/assumptions somewhere deep within the python code or libraries. Does anyone have any idea on how to begin debugging this? It doesn't repro on my machine.
btw, the reason this is an issue is that on these machines where the program fails, the image never gets any more image-like than the seed image. It's always a blotchy seed-like image no matter how many iterations are run.
Update: We've narrowed down the problem to something that occurs on the line
iii = perceptor.encode_image(normalize(make_cutouts(out))).float()
"out" variable is regular numbers but "iii" variable is all "nan". Will update more after adding more debugging statements and having him run the debugging again to narrow it down further.
Something happens inside "encode_image". I dug into CLIP/clip/model.py and put debugging statements. Inside the "forward" method of VisionTransformer, there's a series of transfofrmation of the variable "x". x contains "nan" after the line self.conv1(x)
. Then magically it no longer has "nan" after the line x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)
. Then for some inconceivable reason, it contains nan again, after the line self.transformer(x)
. I must reiterate this is only reproducible on the other user's machine and I can't reproduce it on my end. On my machine (and most people's machines), it never contains nan.
This is very hard to debug; I beg anyone with knowledge of this system to chime in.
I've now narrowed it down to _conv_forward in torch/nn/modules/conv.py. The line of code is:
F.conv2d(input, weight, bias, self.stride, self.padding, self.dilation, self.groups)
If bias is None, the returned tensor has nan for the users who suffer from this bug. This doesn't happen for everyone; it only happens to about 1% of users.
If bias is not None, there is no nan in either case.
Possibly related is https://github.com/pytorch/pytorch/issues/59439
More updates: Calling conv2d with bias True didn't solve the issue. Neither did updating to pytorch 1.11.0.
Update: It might be related to: https://github.com/pytorch/pytorch/issues/58123 https://github.com/openai/glide-text2im/issues/31 https://discuss.pytorch.org/t/half-precision-convolution-cause-nan-in-forward-pass/117358/3 https://github.com/pytorch/pytorch/issues/69449 https://github.com/ultralytics/yolov5/issues/5815
A possible fix might be to update above cuDNN 8.2.2. Note that by default even the absolute latest pytorch library includes something like cuDNN 8.2. So after installing pytorch, download cuDNN separately and patch in the DLL files. Will comment later with updates.
Updating above cuDNN 8.2.2 fixed the issue!
Glad you got it sorted!
Example last few iterations of a user for whom it's not working:
Example last few iterations for my PC:
I have no clue how to even begin debugging this