Inception-ResNet-V2 training issue

vladpaunescu commented 6 years ago

Hi,

Thank you for making public Mask RCNN on github. It is really amazing work. I tried to replace the ResNet-101 encoder with Inception-ResNet-V2 encoder from keras. Unfortunately, I didn't get better results.

These are the endpoints I use to build the feature pyramid. They correspond to the different scales.

  # scale /2 is not unused
    C1 = None

    # scale /4  Conv2d_4a_3x3 InceptionResnetV2/Conv2d_4a_3x3/Relu:0 (1, 256, 256, 192)
    C2 = end_points['Conv2d_4a_3x3']

    # scale /8  Mixed_5b InceptionResnetV2/Mixed_5b/concat:0 (1, 128, 128, 320)
    C3 = end_points['Mixed_5b']

    # scale /16  PreAuxLogits InceptionResnetV2/Repeat_1/block17_20/Relu:0 (1, 64, 64, 1088)
    C4 = end_points['PreAuxLogits']

    # scale /32   Conv2d_7b_1x1 Conv2d_7b_1x1 InceptionResnetV2/Conv2d_7b_1x1/Relu:0 (1, 32, 32, 1536)
    C5 = end_points['Conv2d_7b_1x1']

    return [C1, C2, C3, C4, C5]

I'm training InceptionResNet-V2 on coco dataset train + valvalminusminival with one GPU, one image/gpu 2000 steps / epoch. Initial learning rate is 0.006. And training strategy is first end to end and then fine tune for heads.

        #Training - Stage 1
        print("Train all layers")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE,
                    epochs=80,
                    layers='all')

        # Training - Stage 2
        # Train all layers
        print("Train all layers")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE / 10,
                    epochs=160,
                    layers='all')

        # Training - Stage 3
        # Train all layers
        print("Train all layers")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE / 100,
                    epochs=240,
                    layers='all')

        # Training - Stage 4
        print("Fine-tune network heads")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE / 100,
                    epochs=320,
                    layers='heads')

Unfortunately, loss doesn't decrease as provided model: train loss is 0.84, and val loss 0.2. bbox results are lower. Provided model has train loss 0.7 when finetuning.

Evaluating checkpoint 217. Path: /home/vlad/git/obstacle-detection/instance-od/mask-rcnn-up/logs/inception/coco20171220T2121/mask_rcnn_coco_0217.h5
Loading and preparing results...
DONE (t=0.01s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=2.01s).
Accumulating evaluation results...
DONE (t=0.59s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.287
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.454
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.313
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.153
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.335
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.426
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.256
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.350
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.176
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.389
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.509
Prediction time: 114.29512548446655. Average 0.2285902509689331/image
Total time:  124.09156823158264

I have some questions:

I noticed you use ResNet101 encoder and download ResNet50 weights. How do you apply the weights from ResNet50 to ResNet101?
I'm trying to reproduce the results from the repository using ResNet101. I currently have:
- 1 GPU. 2 Images / GPU, STEPS_PER_EPOCH = 1000.

The training strategy is default (as in given example):

       # *** This training schedule is an example. Update to your needs ***

        # Training - Stage 1
        print("Training network heads")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE,
                    epochs=40,
                    layers='heads')

        # Training - Stage 2
        # Finetune layers from ResNet stage 4 and up
        print("Fine tune Resnet stage 4 and up")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE,
                    epochs=120,
                    layers='4+')

        # Training - Stage 3
        # Fine tune all layers
        print("Fine tune all layers")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE / 10,
                    epochs=160,
                    layers='all')

Is there anything I need to add to reproduce the results? I'm downloading ImageNet ResNet50 weights for ResNet101 encoder.

How can I train using Inception-Resnet-V2 encoder? I also tried to use atrous convolution in RPN when building FPN. I managed to decrease the training loss under 1.00, but it's not enough. Why do you first train the heads (freeze the encoder) and then fine tune the encoder? In my experiements, I first trrained end-to-end, and then fine-tuned the heads. Am I missing something? Do you do any data augmentation, besides the random horizotnal flips from load_image_gt method?

# atrous RPN

  rate = (1, 1)
        if config.ATROUS == True:
            rate = (2, 2)
        # Attach 3x3 conv to all P layers to get the final feature maps.
        P2 = KL.Conv2D(256, (3, 3), dilation_rate=rate, padding="SAME", name="fpn_p2")(P2)
        P3 = KL.Conv2D(256, (3, 3), dilation_rate=rate, padding="SAME", name="fpn_p3")(P3)
        P4 = KL.Conv2D(256, (3, 3), dilation_rate=rate, padding="SAME", name="fpn_p4")(P4)
        P5 = KL.Conv2D(256, (3, 3), dilation_rate=rate, padding="SAME", name="fpn_p5")(P5)

I'm pasting here the loss from tensorboard when using InceptionResnetV2. At last stage of training, when only the heads are training with learning rate/100, the loss seems to jump up instead of decreasing. Maybe it's because I used the same learning rate as previous stage. For all other learning stages, when learning rate is decreased 10 times, loss decreases.

screenshot from 2018-01-03 16-51-42

Thank you, Vlad

waleedka commented 6 years ago

I noticed you use ResNet101 encoder and download ResNet50 weights. How do you apply the weights from ResNet50 to ResNet101?

Yes, this was discussed in another thread as well. I used ResNet50 weights because these were the ones available from Keras. And since we're providing COCO trained weights, I figured it's better to start with the COCO weights anyway for most cases. If I get some free time, I might try to find ResNet101 weights and use them instead. It might improve the performance a bit. If you're interested in implementing and submitting a pull request, I'd be happy to review it.

I'm trying to reproduce the results from the repository using ResNet101

I think you'll need to use a bigger batch size, hence the 8-GPU training. At some point I had the model trained on 1GPU and then switched to 8GPUs and continued training, I noticed a good improvement from that switch. Also, I think I trained it longer than the example schedule above. I don't remember the details, but that's why I put a note stating that this schedule is just an example.

If you'd rather only train on 1 GPU, an alternative is to batch the updates from every 8 steps and average them before applying them to the weights. But that would require changes to the Keras optimizer. Not too complex, but not too simple either.

Why do you first train the heads (freeze the encoder) and then fine tune the encoder?

When you start, the backbone has good weights trained on ImageNet, but the heads have random weights. If you train all layers, then you end up updating the backbone weights using gradients computed based on the random weights in heads. This will cause unnecessary changes to the backbone weights. Training the heads only, ensures that we don't touch the good backbone weights until the heads had a bit of time to settle.

Another approach to handle this situation is to do a warm up phase. Here you train all layers but with a much smaller learning rate (say /100) and then after things settle you start using your original learning rate.

Do you do any data augmentation, besides the random horizotnal flips from load_image_gt method?

No

when only the heads are training with learning rate/100, the loss seems to jump up

Hmm. Hard to guess. Based on your graphs, it looks like the training loss is still decreasing, but the validation loss goes up. That usually suggest over-fitting, but there could also be something else.

jmtatsch commented 6 years ago

I already adapted your code for loading 101 layer resnet weights and left it training over the holidays. When I get some acceptable results, I will make a pull request. Maybe you can retrain on 8 GPUs then.

jmtatsch commented 6 years ago

Concerning larger batch sizes on limited vram maybe we can revive the following keras issue? https://github.com/keras-team/keras/issues/5244

vladpaunescu commented 6 years ago

Hello again! I trained with ResNet101 in order to reproduce official results.

My training protocol is default as given by example:


        # Training - Stage 1
        print("Training network heads")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE,
                    epochs=40,
                    layers='heads')

        # Training - Stage 2
        # Finetune layers from ResNet stage 4 and up
        print("Fine tune Resnet stage 4 and up")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE,
                    epochs=120,
                    layers='4+')

        # Training - Stage 3
        # Fine tune all layers
        print("Fine tune all layers")
        model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE / 10,
                    epochs=160,
                    layers='all')

Other hyperparameters are default. GPU count is 1. Image/GPU is 2:

  # Learning rate and momentum
 # NUMBER OF GPUs to use. For CPU training, use 1
    GPU_COUNT = 1

    # Number of images to train with on each GPU. A 12GB GPU can typically
    # handle 2 images of 1024x1024px.
    # Adjust based on your GPU memory and image sizes. Use the highest
    # number that your GPU can handle for best performance.
    IMAGES_PER_GPU = 2

    # Number of training steps per epoch
    # This doesn't need to match the size of the training set. Tensorboard
    # updates are saved at the end of each epoch, so setting this to a
    # smaller number means getting more frequent TensorBoard updates.
    # Validation stats are also calculated at each epoch end and they
    # might take a while, so don't set this too small to avoid spending
    # a lot of time on validation stats.
    STEPS_PER_EPOCH = 1000

    # The Mask RCNN paper uses lr=0.02, but on TensorFlow it causes
    # weights to explode. Likely due to differences in optimzer
    # implementation.
    LEARNING_RATE = 0.001
    LEARNING_MOMENTUM = 0.9

Unfortunately, the results are way below official release. My best checkpoint has:


Epoch 150 (epoch 151 actually since counting starts from 0):

Evaluate annotation type *bbox*
DONE (t=2.54s).
Accumulating evaluation results...
DONE (t=0.76s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.224
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.426
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.211
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.114
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.270
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.346
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.209
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.300
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.306
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.141
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.345
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.456

Besides that, I had a bug when evaluating Inception-ResNet-V2 model (trained with Imagnet mean subtraction, evaluated without). After the fix, best accuracy is:

Epoch 217:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.287
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.454
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.313
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.153
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.335
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.426
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.256
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.350
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.176
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.389
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.509

Even though Inception-Resnet-V2 outperforms ResNet101, both of them are well below reported results. That might be caused by:

Different training protocols
Small batch size

Please, if you have better results, or any ideas of how to improve the experiments (especially using Inception-Resnet-V2 backbone), post here.

@waleedka Thank you for your detailed explanation, and for reopening the issue of Virtual Batch Size in keras.

Vlad

waleedka commented 6 years ago

@vladpaunescu Out of curiosity, are you using Python 3 or 2.7?

Pelups commented 6 years ago

@vladpaunescu, your results seem quite good if you only trained on 1 GPU no ? You trained on 160 epochs 1000 steps per epoch 2 images per GPU 1 GPU = 320 000 images Official results get obtained on 160 000 steps 2 images per GPU * 8 GPUs = 2 560 000 images

Correct me if I'm wrong.

BTW, I'm really insteresting in a ResNet 101 implementation as you did.

John1231983 commented 6 years ago

@jmtatsch : Have you complete the training using offical resnet-101 pretrained model. Could you share us your result now?

matiqul commented 6 years ago

Do you use this model link below: https://github.com/tensorflow/models/blob/master/research/slim/nets/inception_resnet_v2.py

matiqul commented 6 years ago

@vladpaunescu Can you telll me how I can draw the mrcnn_bbx_loss, mrcnn_class_loss, rpn_bbx_loss plot like you....?

John1231983 commented 6 years ago

@vladpaunescu: I think your shape is wrong. As example of resnet50. It should be C1=C2=/2 , C3 =/4 , C4=/8 , and C5=/16. Am I right?

AloshkaD commented 6 years ago

@matiqul go to the directory where Mask_RCNN is and type in the command

tensorboard --logdir=logs

then open a browser and write http://localhost:6006 and you should get all the plots you need from tensorboard

matiqul commented 6 years ago

@AloshkaD thanks it works.....

enoceanwei commented 6 years ago

@vladpaunescu

Hi, I am interesting your job, I also want to changed the backbone CNN structrue from resnet to inception resnet, which is just similar like you, could you introduce more details how to do that? many thanks.

Kind Regards

Wei

babanthierry94 commented 2 years ago

@vladpaunescu @enoceanwei Hi, I hope you succeed on this task. Please I'm trying to do the same task. Please can you help me with some advice.

matterport / Mask_RCNN

Inception-ResNet-V2 training issue #168