MaskRCNN crashes when reshaping an empty tensor rel_codes

mattans commented 5 years ago

torchvision '0.4.0+cu92'

Traceback:

creating index...
index created!
Traceback (most recent call last):
  File "./scratch_19.py", line 1068, in <module>
    main()
  File "./scratch_19.py", line 1054, in main
    evaluate(model, data_loader_test, device=device)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "./scratch_19.py", line 889, in evaluate
    outputs = model(image)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 52, in forward
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 550, in forward
    boxes, scores, labels = self.postprocess_detections(class_logits, box_regression, proposals, image_shapes)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 474, in postprocess_detections
    pred_boxes = self.box_coder.decode(box_regression, proposals)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/_utils.py", line 168, in decode
    rel_codes.reshape(sum(boxes_per_image), -1), concat_boxes
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

It appears that during a forward pass rel_codes is empty which crashes the reshape operator.

mattans commented 5 years ago

This is a repr of the image, if that helps:

tensor([[[1.0000, 1.0000, 1.0000,  ..., 0.1882, 0.1647, 0.1451],
         [1.0000, 1.0000, 1.0000,  ..., 0.1725, 0.1451, 0.1412],
         [1.0000, 1.0000, 1.0000,  ..., 0.1373, 0.1255, 0.1255],
         ...,
         [0.3647, 0.3608, 0.3608,  ..., 0.3059, 0.3098, 0.3137],
         [0.3608, 0.3608, 0.3608,  ..., 0.2980, 0.3059, 0.3098],
         [0.3608, 0.3725, 0.3765,  ..., 0.2902, 0.3020, 0.3137]],

        [[1.0000, 1.0000, 1.0000,  ..., 0.1882, 0.1686, 0.1529],
         [1.0000, 1.0000, 1.0000,  ..., 0.1725, 0.1490, 0.1490],
         [1.0000, 1.0000, 1.0000,  ..., 0.1490, 0.1412, 0.1412],
         ...,
         [0.3725, 0.3686, 0.3725,  ..., 0.3216, 0.3333, 0.3333],
         [0.3725, 0.3725, 0.3686,  ..., 0.3176, 0.3255, 0.3216],
         [0.3725, 0.3843, 0.3843,  ..., 0.3137, 0.3216, 0.3255]],

        [[1.0000, 1.0000, 1.0000,  ..., 0.2196, 0.1765, 0.1412],
         [1.0000, 1.0000, 1.0000,  ..., 0.2039, 0.1569, 0.1451],
         [1.0000, 1.0000, 1.0000,  ..., 0.1765, 0.1451, 0.1529],
         ...,
         [0.5216, 0.5176, 0.5098,  ..., 0.4275, 0.4353, 0.4471],
         [0.5098, 0.5098, 0.5176,  ..., 0.4353, 0.4392, 0.4431],
         [0.4941, 0.5216, 0.5373,  ..., 0.4392, 0.4471, 0.4549]]],
       device='cuda:0')

fmassa commented 5 years ago

Hi,

This is weird, as this indicates that the RPN doesn't have any proposal (which I'm not sure it should happen).

Do you have a repro that I can use?

alexarmbr commented 4 years ago

I have encountered this issue too. https://github.com/pytorch/vision/blob/d2c763e14efe57e4bf3ebf916ec243ce8ce3315c/torchvision/models/detection/generalized_rcnn.py#L66-L72 Specifically during evaluation of the model, for some training examples, on line 67 the call to self.backbone (which in my case is a FPN) returns a feature pyramid with all NaNs, this seems to be what is causing the problem.

fmassa commented 4 years ago

@arm-buster if your FPN returns all NaNs, this means that your model or your input is corrupted. My guess is that your input data is corrupted, because if it was a problem with the model training would have aborted before this error happens. Can you double-check that all your inputs are well-formed?

Liyw979 commented 4 years ago

I met the same error

File "generalized_rcnn.py", line 64, in forward
targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "roi_heads.py", line 329, in forward
image_shapes)
File "roi_heads.py", line 225, in postprocess_detections
pred_boxes = self.box_coder.decode(box_regression, proposals)
File "_utils.py", line 189, in decode
rel_codes.reshape(box_sum, -1), concat_boxes
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

Liyw979 commented 4 years ago

@fmassa I have a question. What do you mean by "corrupted"? I am manually set some part of my image to 0. Is this what you said "corrupted"? I also notice that in the reply of @mattans there are ares of 1 Many thanks

fmassa commented 4 years ago

@liyiwei979621500 by corrupted I mean that some elements of your input are NaN. That's my current best guess

Liyw979 commented 4 years ago

@fmassa I have solved my problem. I guess it is caused by broken model pkl file

fmassa commented 4 years ago

@liyiwei979621500 thanks for the information.

I'm closing this issue as I suppose it's due to some corrupted data (either input or weights) @mattans if you manage to provide a minimum working example that reproduces the error, please let us know.

jhultman commented 4 years ago

Hi @fmassa, here's a minimum working example. Obviously this initialization is purposely poor but it would be nice if the inference code didn't crash. Note that none of the weights or inputs are NaN, so this could in principle happen by chance.

import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
torch.manual_seed(21)

model = fasterrcnn_resnet50_fpn(num_classes=2).cuda().eval()
state_dict = {k: torch.randn_like(v) for k, v in model.state_dict().items()}
model.load_state_dict(state_dict)
with torch.no_grad():
    model([torch.rand((3, 512, 512)).cuda()])

>>> RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] ...

The problem is with BoxCoder.decode. Here's my attempt at a fix which seems to work for me (unchanged code omitted):

def decode(self, rel_codes, boxes):
    ...
    assert rel_codes.size(0) == box_sum
    pred_boxes = self.decode_single(rel_codes, concat_boxes)
    deltas_per_box = rel_codes.size(-1) // 4
    return pred_boxes.reshape(box_sum, deltas_per_box, 4)

I can submit PR if you think appropriate.

fmassa commented 4 years ago

@jhultman please if you could send a PR (with a test case) it would be great! The test case just needs to test for BoxCoder.encoder and BoxCoder.decode, no need to test the whole model

Alieladi commented 4 years ago

Sometimes it is an exploding gradient problem where the model outputs very high values (> 10**20) that are considered NaN. In that case you must retrain your model from beginning and try lower learning rate or gradient clipping e.g.

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value)
optimizer.step()

helloshuangzi commented 4 years ago

I met the same problem for torchvision 0.5.0. I was wondering if there is any progress about the fix? @fmassa @jhultman

sdw95927 commented 4 years ago

I have encountered this issue too. https://github.com/pytorch/vision/blob/d2c763e14efe57e4bf3ebf916ec243ce8ce3315c/torchvision/models/detection/generalized_rcnn.py#L66-L72

Specifically during evaluation of the model, for some training examples, on line 67 the call to self.backbone (which in my case is a FPN) returns a feature pyramid with all NaNs, this seems to be what is causing the problem.

Thanks for pointing out where the problem is! I decreased my learning rate from 0.002 to 0.0002 and that seems fixed my problem.

tooHotSpot commented 4 years ago

If you encounter exploding grads even with lower learning rates give a try for Adagrad with non-standard 0.001. In general, training that way may be not OK since RPN is trained usually with higher learning rate than box / classifier heads

augustoolucas commented 3 years ago

Is there a definitive solution for this problem? It's happening for me when using FasterRCNN too.

pytorch / vision

MaskRCNN crashes when reshaping an empty tensor rel_codes #1568