Changing backbone's architecture (VGG16, InceptionV3 and ResNeXt) results in NaN losses

sbugallo commented 6 years ago

Hi,

I've trying to replace the ResNet 101 used as backbone with other architectures (e.g. VGG16, Inception V3, ResNeXt 101 or Inception ResNet V2) in order to check whether the results improve or not.

The problem is that, whenever I substitute the ResNet with any other architecture, the training losses of the mask branch are NaN or zero:

loss: nan - rpn_class_loss: 0.6948 - rpn_bbox_loss: 0.3827 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.6931 - val_rpn_bbox_loss: 0.2744 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00

These are the implementations I have been using:

Inception ResNet V2
https://gist.github.com/BugaDM/09b1b76d04a570102c966a31f7d37198
Inception V3
https://gist.github.com/BugaDM/aabd048bcb1d7ab26ece4c3499f826e0
ResNeXt 101
https://gist.github.com/BugaDM/f6e3174953f93346b6002d9adc7eb3e5
VGG16
https://gist.github.com/BugaDM/cb70226bed33c0de49b289a8fbd4b667

I do not get any errors during execution. Any suggestions?

waleedka commented 6 years ago

The NaN might be due to having a very large loss value that causes an overflow. But what I'm more concerned about are the 0 values in your bounding box and mask losses. Losses shouldn't be zero, so I think this is not related to the choice of backbone, but rather you have a bug in your code somewhere.

maksimovkonstantin commented 6 years ago

@waleedka does that mean that nan in general is caused by wrong prepared data?

paulcx commented 6 years ago

@waleedka I tried to use Inception ResNet V2 as backbone but it drops the error from merging different shapes of C4. I'm wondering if the inception resnet v2 could be imported without modifying architecture. If not. @BugaDM How do you deal with that?

sbugallo commented 6 years ago

@paulcx Which layers are you using as endpoints? You are probably picking the wrong layer. Check that your C1, C2, C3, C4 and C5 have 1/2, 1/4, 1/8, 1/16 and 1/32 size with respect to the input's

paulcx commented 6 years ago

@BugaDM I adopted as same as the endpoints you provided above (https://gist.github.com/BugaDM/09b1b76d04a570102c966a31f7d37198). The C4 has 17 x 17 x 1088 shape?

sbugallo commented 6 years ago

@paulcx I'm using a 1024x2014 input, and my endpoints are: C1 = Tensor("block1_pool/MaxPool:0", shape=(?, 512, 512, 64), dtype=float32) C2 = Tensor("block2_pool/MaxPool:0", shape=(?, 256, 256, 128), dtype=float32) C3 = Tensor("block3_pool/MaxPool:0", shape=(?, 128, 128, 256), dtype=float32) C4 = Tensor("block4_pool/MaxPool:0", shape=(?, 64, 64, 512), dtype=float32) C5 = Tensor("block5_pool/MaxPool:0", shape=(?, 32, 32, 512), dtype=float32)

As you can see, they are /2, /4, /8, /16, /32 of the input size.

paulcx commented 6 years ago

@BugaDM Are these endpoints for "Inception ResNet V2"? I saw the endpoints which does not match the tensor shape within https://gist.github.com/BugaDM/09b1b76d04a570102c966a31f7d37198.

paulcx commented 6 years ago

@waleedka Hi waleedka, do you have any idea of pointing out the corrent endpoints with Inception ResNet V2 or alternative solutions?

John1231983 commented 6 years ago

@BugaDM you are wong. It must be C1=C2= /2, C3=/4 C4 =/8 and C5=/16. You can print the shape in resnet50 as example

arivle commented 2 years ago

Hi,

I've trying to replace the ResNet 101 used as backbone with other architectures (e.g. VGG16, Inception V3, ResNeXt 101 or Inception ResNet V2) in order to check whether the results improve or not.

The problem is that, whenever I substitute the ResNet with any other architecture, the training losses of the mask branch are NaN or zero:

loss: nan - rpn_class_loss: 0.6948 - rpn_bbox_loss: 0.3827 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.6931 - val_rpn_bbox_loss: 0.2744 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00

These are the implementations I have been using:
* Inception ResNet V2

* https://gist.github.com/BugaDM/09b1b76d04a570102c966a31f7d37198

* Inception V3

* https://gist.github.com/BugaDM/aabd048bcb1d7ab26ece4c3499f826e0

* ResNeXt 101

* https://gist.github.com/BugaDM/f6e3174953f93346b6002d9adc7eb3e5

* VGG16

* https://gist.github.com/BugaDM/cb70226bed33c0de49b289a8fbd4b667
I do not get any errors during execution. Any suggestions?

I need the VGG16 implementations. would you mind to share it again? the link shown me "page not found" message. thanks in advance

matterport / Mask_RCNN

Changing backbone's architecture (VGG16, InceptionV3 and ResNeXt) results in NaN losses #213