Open mattans opened 5 years ago
This is a repr of the image
, if that helps:
tensor([[[1.0000, 1.0000, 1.0000, ..., 0.1882, 0.1647, 0.1451],
[1.0000, 1.0000, 1.0000, ..., 0.1725, 0.1451, 0.1412],
[1.0000, 1.0000, 1.0000, ..., 0.1373, 0.1255, 0.1255],
...,
[0.3647, 0.3608, 0.3608, ..., 0.3059, 0.3098, 0.3137],
[0.3608, 0.3608, 0.3608, ..., 0.2980, 0.3059, 0.3098],
[0.3608, 0.3725, 0.3765, ..., 0.2902, 0.3020, 0.3137]],
[[1.0000, 1.0000, 1.0000, ..., 0.1882, 0.1686, 0.1529],
[1.0000, 1.0000, 1.0000, ..., 0.1725, 0.1490, 0.1490],
[1.0000, 1.0000, 1.0000, ..., 0.1490, 0.1412, 0.1412],
...,
[0.3725, 0.3686, 0.3725, ..., 0.3216, 0.3333, 0.3333],
[0.3725, 0.3725, 0.3686, ..., 0.3176, 0.3255, 0.3216],
[0.3725, 0.3843, 0.3843, ..., 0.3137, 0.3216, 0.3255]],
[[1.0000, 1.0000, 1.0000, ..., 0.2196, 0.1765, 0.1412],
[1.0000, 1.0000, 1.0000, ..., 0.2039, 0.1569, 0.1451],
[1.0000, 1.0000, 1.0000, ..., 0.1765, 0.1451, 0.1529],
...,
[0.5216, 0.5176, 0.5098, ..., 0.4275, 0.4353, 0.4471],
[0.5098, 0.5098, 0.5176, ..., 0.4353, 0.4392, 0.4431],
[0.4941, 0.5216, 0.5373, ..., 0.4392, 0.4471, 0.4549]]],
device='cuda:0')
Hi,
This is weird, as this indicates that the RPN doesn't have any proposal (which I'm not sure it should happen).
Do you have a repro that I can use?
I have encountered this issue too. https://github.com/pytorch/vision/blob/d2c763e14efe57e4bf3ebf916ec243ce8ce3315c/torchvision/models/detection/generalized_rcnn.py#L66-L72 Specifically during evaluation of the model, for some training examples, on line 67 the call to self.backbone (which in my case is a FPN) returns a feature pyramid with all NaNs, this seems to be what is causing the problem.
@arm-buster if your FPN returns all NaNs, this means that your model or your input is corrupted. My guess is that your input data is corrupted, because if it was a problem with the model training would have aborted before this error happens. Can you double-check that all your inputs are well-formed?
I met the same error
File "generalized_rcnn.py", line 64, in forward
targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "roi_heads.py", line 329, in forward
image_shapes)
File "roi_heads.py", line 225, in postprocess_detections
pred_boxes = self.box_coder.decode(box_regression, proposals)
File "_utils.py", line 189, in decode
rel_codes.reshape(box_sum, -1), concat_boxes
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous
@fmassa I have a question. What do you mean by "corrupted"? I am manually set some part of my image to 0. Is this what you said "corrupted"? I also notice that in the reply of @mattans there are ares of 1 Many thanks
@liyiwei979621500 by corrupted I mean that some elements of your input are NaN
. That's my current best guess
@fmassa I have solved my problem. I guess it is caused by broken model pkl file
@liyiwei979621500 thanks for the information.
I'm closing this issue as I suppose it's due to some corrupted data (either input or weights) @mattans if you manage to provide a minimum working example that reproduces the error, please let us know.
Hi @fmassa, here's a minimum working example. Obviously this initialization is purposely poor but it would be nice if the inference code didn't crash. Note that none of the weights or inputs are NaN, so this could in principle happen by chance.
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
torch.manual_seed(21)
model = fasterrcnn_resnet50_fpn(num_classes=2).cuda().eval()
state_dict = {k: torch.randn_like(v) for k, v in model.state_dict().items()}
model.load_state_dict(state_dict)
with torch.no_grad():
model([torch.rand((3, 512, 512)).cuda()])
>>> RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] ...
The problem is with BoxCoder.decode. Here's my attempt at a fix which seems to work for me (unchanged code omitted):
def decode(self, rel_codes, boxes):
...
assert rel_codes.size(0) == box_sum
pred_boxes = self.decode_single(rel_codes, concat_boxes)
deltas_per_box = rel_codes.size(-1) // 4
return pred_boxes.reshape(box_sum, deltas_per_box, 4)
I can submit PR if you think appropriate.
@jhultman please if you could send a PR (with a test case) it would be great!
The test case just needs to test for BoxCoder.encoder
and BoxCoder.decode
, no need to test the whole model
Sometimes it is an exploding gradient problem where the model outputs very high values (> 10**20) that are considered NaN. In that case you must retrain your model from beginning and try lower learning rate or gradient clipping e.g.
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value)
optimizer.step()
I met the same problem for torchvision 0.5.0. I was wondering if there is any progress about the fix? @fmassa @jhultman
I have encountered this issue too. https://github.com/pytorch/vision/blob/d2c763e14efe57e4bf3ebf916ec243ce8ce3315c/torchvision/models/detection/generalized_rcnn.py#L66-L72
Specifically during evaluation of the model, for some training examples, on line 67 the call to self.backbone (which in my case is a FPN) returns a feature pyramid with all NaNs, this seems to be what is causing the problem.
Thanks for pointing out where the problem is! I decreased my learning rate from 0.002 to 0.0002 and that seems fixed my problem.
If you encounter exploding grads even with lower learning rates give a try for Adagrad with non-standard 0.001. In general, training that way may be not OK since RPN is trained usually with higher learning rate than box / classifier heads
Is there a definitive solution for this problem? It's happening for me when using FasterRCNN too.
torchvision '0.4.0+cu92'
Traceback:
It appears that during a forward pass rel_codes is empty which crashes the reshape operator.