Suffer from overfitting

keven4ever commented 6 years ago

Hello,

I only have a small training set with about 670 labelled images and would like to further improve the accuracy by training entire backbone network instead of only heads. However, after about 30,40 epoch, the network suffer from overfitting already. ResNet already uses batch norm, so i wonder if there is sth else i can do to improve the situation? How about dropout? If i apply dropout, can i still load the pre-trainned resent weight from CoCo or Imagenet? Or some other technique? Thank you!

tonyzhao6 commented 6 years ago

@keven4ever

With such a small dataset, it is unlikely that BN or dropout will help. Also, BN with dropout is probably not a good idea (see paper on BN) and I don't think you can apply dropout with the pre-trained ResNet weights since that model didn't train using dropout in the first place.

The model capacity of ResNet-101 might be too large for your dataset. While it's true that ResNet enables deeper networks to converge compared to their plain counterparts, there is still a limit on the number of layers that can be incorporated in a ResNet before convergence suffers. For example, Table 6 in the ResNet paper shows that the classification error on CIFAR-10 decreases with increasing number of layers in ResNet up until ResNet-1202. ResNet-1202 actually performs worse than ResNet-32.

To prevent overfitting, you can try:

1) Getting a larger dataset (but this is probably not feasible, otherwise you would've done this already) 2) Stronger weight decay (i.e., L2 regularization) 3) Lower model capacity (e.g., ResNet-50 or even ResNet-32) 3) k-fold cross-validation

maksimovkonstantin commented 6 years ago

@keven4ever do you use augmentations? because in ds bowl 2018 it is critical. I had the same problem, augmentations helped me a lot.

John1231983 commented 6 years ago

@maksimovkonstantin : Thanks. What kind of augmentation techniques do you use? How much gain do you achieved? I checked that your score is 0.437. What is score without augmentation?

keven4ever commented 6 years ago

@maksimovkonstantin very good question! Actually i tried augmentation (without train full backbone) which only helps to improve loss but not val_loss. Also i tried to train full backbone with default augmentation (flip l/r) which suffer from overfitting. Next step i will try to combine both, btw, what kind of augmentation did you apply? flip l/r, flip u/d, rotate 90?

keven4ever commented 6 years ago

@FruVirus thx for tips, i also intend to try a shallow model like ResNet-50. i saw in model.py's resnet_graph method, it supports both resnet50 and resnet101, just change architecture to resnet50 should be sufficient, right?

tonyzhao6 commented 6 years ago

@keven4ever , yes I believe so. I'd be interested to hear if this helps with your dataset.

keven4ever commented 6 years ago

@FruVirus sure, will keep you updated! Btw, is there easy way to load coco pre-trained weight for ResNet50 FPN?

maksimovkonstantin commented 6 years ago

@John1231983 score without aug is 0.413

keven4ever commented 6 years ago

@maksimovkonstantin i tried to do some augmentation before image resizing, including flip l/r, flip u/d and rotate 90 degree, with ResNet101, as you can see, again it starts to overfit. What kind of aug did you apply? Are you using ResNet101 or 50?

maksimovkonstantin commented 6 years ago

@keven4ever I use default ResNet101, also I use rotate on custom angle here is my aug function

def data_augmentation(input_images,
                      h_flip=True,
                      v_flip=True,
                      rotation=360,
                      zoom=1.5,
                      brightness=0.5,
                      crop=False):
    # first is input all other are output
    # Data augmentation
    output_images = input_images.copy()
    if crop and random.randint(0, 1):
        # random crop
        h, w, c = output_images[0].shape
        upper_h, new_h, upper_w, new_w = locs_for_random_crop(h, w)
        output_images = [input_image[upper_h:upper_h + new_h, upper_w:upper_w + new_w, :] for input_image in output_images]

    # random flip
    if h_flip and random.randint(0, 1):
        output_images = [cv2.flip(input_image, 1) for input_image in output_images]
    if v_flip and random.randint(0, 1):
        output_images = [cv2.flip(input_image, 0) for input_image in output_images]

    factor = 1.0 + abs(random.gauss(mu=0.0, sigma=brightness))
    if random.randint(0, 1):
        factor = 1.0 / factor
    table = np.array([((i / 255.0) ** factor) * 255 for i in np.arange(0, 256)]).astype(np.uint8)
    output_images[0] = cv2.LUT(output_images[0], table)
    if rotation:
        angle = random.randint(0, rotation)
    else:
        angle = 0.0
    if zoom:
        scale = random.randint(50, zoom * 100) / 100
    else:
        scale = 1.0
    # print(angle, scale)
    if rotation or zoom:
        for i, input_image in enumerate(output_images):
            M = cv2.getRotationMatrix2D((input_image.shape[1] // 2, input_image.shape[0] // 2), angle, scale)
            # M = cv2.getRotationMatrix2D((input_image.shape[1] // 2, input_image.shape[0] // 2), 45, 1)
            output_images[i] = cv2.warpAffine(input_image, M, (input_image.shape[1], input_image.shape[0]))
    # print('len of output %s' % len(output_images))
    return [input_image.astype(np.uint8) for input_image in output_images]

keven4ever commented 6 years ago

@maksimovkonstantin looks great! Thx so much!

John1231983 commented 6 years ago

@maksimovkonstantin: Me too, I also got the 0.41 LB with left right, up down flip and Adam optimization. One more thing, do you use fixed dataset (made by Konstantin Lopuhin) to obtain 0.413 LB? @keven4ever : What is optimization are you using? I am using Adam with 80 epochs with all

model.train(dataset_train, dataset_val,
            learning_rate=1e-4,
            epochs=80,
            layers='all')

keven4ever commented 6 years ago

@John1231983 i still use SGD as it is the one used in paper. @John1231983 are you able to avoid overfitting when train all with only flipping augmentation? Since this is exactly what i did, the only difference is optimiser.

John1231983 commented 6 years ago

I think I did not have it. Let see my log screenshot from 2018-03-02 22-51-51

This is my training schedule with Adam method

LEARNING_RATE=1e-4
model.train(dataset_train, dataset_val,
            learning_rate=LEARNING_RATE,
            epochs=40,
            layers='all')
model.train(dataset_train, dataset_val, 
            learning_rate=LEARNING_RATE/10,
            epochs=80, 
            layers="all")

model.train(dataset_train, dataset_val,
            learning_rate=LEARNING_RATE/100,
            epochs=120,
            layers='all')

For above, I got 0.41 LB with the fixed dataset using resnet-50. Could you tell me what is base score did you achieve? Base score means use original mask-rcnn implementation.

keven4ever commented 6 years ago

@John1231983 my base score is 0.448 but as i mentioned, it is hard to reproduce however, i also managed to archive 0.44+ several times without train all network, of course i tune several parameters as mentioned in other thread

John1231983 commented 6 years ago

Great. I guess I miss some parameters. So I think you just change hyper-parameters and achieved 0.44+. Am I right? Do you train the network with different training input, such as gray input for one network, and color input for another network? This is my hyper-parameters setting. How about you?

   USE_MINI_MASK = True
    MINI_MASK_SHAPE = (56, 56)  
    GPU_COUNT = 1
    IMAGES_PER_GPU = 2
    bs = GPU_COUNT * IMAGES_PER_GPU
    STEPS_PER_EPOCH = 600  // bs
    VALIDATION_STEPS = 70 // bs
    NUM_CLASSES = 1 + 1 
    IMAGE_MIN_DIM = 512
    IMAGE_MAX_DIM = 512
    IMAGE_PADDING = True 
    RPN_ANCHOR_SCALES = (8, 16, 32, 64, 128)  
    BACKBONE_STRIDES = [4, 8, 16, 32, 64]
    RPN_TRAIN_ANCHORS_PER_IMAGE = 320 #300
    POST_NMS_ROIS_TRAINING = 2000
    POST_NMS_ROIS_INFERENCE = 2000
    POOL_SIZE = 7
    MASK_POOL_SIZE = 14
    MASK_SHAPE = [28, 28]
    TRAIN_ROIS_PER_IMAGE = 512
    RPN_NMS_THRESHOLD = 0.7
    MAX_GT_INSTANCES = 256
    DETECTION_MAX_INSTANCES = 400 
    DETECTION_MIN_CONFIDENCE = 0.7
    DETECTION_NMS_THRESHOLD = 0.3    
    MEAN_PIXEL = np.array([42.17746161,38.21568456,46.82167803])
    WEIGHT_DECAY = 0.0001

keven4ever commented 6 years ago

@John1231983 correct! I think increase TRAIN_ROIS_PER_IMAGE to 512 help me boots the performance a lot, before that i got around 0.414. Also i use original image instead of gray input.

John1231983 commented 6 years ago

I think you can boot more using the schemes: cluster training set into 3 sets: Train each set by mask rcnn, then you obtained 3 checkpoints. After that, apply each checkpoint for each cluster in the test set.

maksimovkonstantin commented 6 years ago

@John1231983 do you use augmentation or you get 0.44 with your above config on clear images?

John1231983 commented 6 years ago

@maksimovkonstantin : I just use a simple augmentation as left right and up down. I will use your augmentation. Thanks again. For above setting, I got 0.41. Only @keven4ever achieved 0.44, not me :(

keven4ever commented 6 years ago

@John1231983 i tried with three class approach (white, black and purple), but only in a single model, not get as high as 0.448, but maybe 0.43 or 0.44+, so no gain. I will try your approach after manage to get all network trained.

keven4ever commented 6 years ago

@maksimovkonstantin actually i got 0.448 with only flip l/r augmentation.

John1231983 commented 6 years ago

@maksimovkonstantin : I think your code somehow wrong because you have to rotation,filp both image and its masks,boxes. Your code only augment the image

keven4ever commented 6 years ago

@maksimovkonstantin @John1231983 i am still not fully convinced by zoom and crop based augmentation? For example, if we always crop 128x128 patch from original image, then to use mask rcnn, we still need to scale it up to sth like 512x512, this will always increase the size of cell during training, will model fail to predict small cells?

John1231983 commented 6 years ago

@keven4ever : For cropping, it only for making dataset larger. Actually, for semantic segmentation, we do not need to resize a fixed size as 512x512, so it may improve performance. For mask-rcnn, we have to use a fixed input as 512x512 or 1024x1024, so I guess it will not improve performance because we add many zero padding to image

keven4ever commented 6 years ago

@FruVirus I tried ResNet50 and train everything from scratch. With data augmentation, there is no overfitting pb any more, however, the mAP is still much worse than training only heads with ResNet101( pre-loaded coco weight). I think pre-loaded weight makes quite much of difference (i only have single GTX 1080, took two days to train ResNet50).

John1231983 commented 6 years ago

@keven4ever: if I understand correctly, you only train 'head' for coco weight , and did not train 'all' to achieve .43+ score. Am I right? If that, I guess you may need to train all when you see the overfitting. For ex, train all after 20 epoches.

keven4ever commented 6 years ago

@John1231983 that's correct!

John1231983 commented 6 years ago

Thanks. Could you provide your LB using resnet50 and train from scratch. I achieved .41 with resnet50 and imagenet pretrain, train all , ignore training heads

paulcx commented 6 years ago

Hey guys, anyone knows how to add focal loss?

keven4ever commented 6 years ago

@John1231983 I only got 0.376. Btw, where did you download the pre-trained imagenet rest50 weight?

John1231983 commented 6 years ago

@keven4ever : Too low. I got 0.41 with it. Now, I am using coco pretrain and hope it better.

FYI, this is the link to download pre-train models (resnet, inception...), but I used it and they provided the worst results than resnet50 https://github.com/fchollet/deep-learning-models/releases

This is my learning schedule. Do you use same as me?

model.train(dataset_train, dataset_val,
            learning_rate=bowl_config.LEARNING_RATE/10,
            epochs=10,
            layers="heads")

model.train(dataset_train, dataset_val,
            learning_rate=bowl_config.LEARNING_RATE / 10,
            epochs=40,
            layers="all")
model.train(dataset_train, dataset_val,
            learning_rate=bowl_config.LEARNING_RATE / 100,
            epochs=80,
            layers="all")

maksimovkonstantin commented 6 years ago

@John1231983 @keven4ever I trained with SGD 0.001 100 epochs heads and 60 epochs 4+ using pretrained coco weights and ResNet101 backbone - it gives around 0.435 score, I think key to sucess is to train all only in the very last epochs.

John1231983 commented 6 years ago

@maksimovkonstantin : Very funny. I have changed many setting and find the way to obtain better but it looks that using default strategy give better performance. In summarize, could you confirm to us about your strategy like that?

model.train(dataset_train, dataset_val,
                    learning_rate=config.LEARNING_RATE,
                    epochs=100,
                    layers='heads')

# Training - Stage 2
# Finetune layers from ResNet stage 4 and up
print("Fine tune Resnet stage 4 and up")
model.train(dataset_train, dataset_val,
            learning_rate=config.LEARNING_RATE,
            epochs=60,
            layers='4+')

# Training - Stage 3
# Fine tune all layers
print("Fine tune all layers")
model.train(dataset_train, dataset_val,
            learning_rate=config.LEARNING_RATE / 10,
            epochs=10,
            layers='all')

maksimovkonstantin commented 6 years ago

@John1231983 exactly!) with config below `class BowlConfig(Config): NAME = "nucleos" GPU_COUNT = 2 IMAGES_PER_GPU = 1

NUM_CLASSES = 1 + 1  # background + 1 area

IMAGE_MIN_DIM = 256
IMAGE_MAX_DIM = 512
IMAGE_PADDING = True
RPN_ANCHOR_SCALES = (16, 32, 64, 128, 256)  # anchor side in pixels

 TRAIN_ROIS_PER_IMAGE = 1024

ROI_POSITIVE_RATIO = 0.33

STEPS_PER_EPOCH =  550 // (IMAGES_PER_GPU * GPU_COUNT) 

VALIDATION_STEPS = 50 // (IMAGES_PER_GPU * GPU_COUNT)

MEAN_PIXEL = [43.53, 39.56, 48.22]

LEARNING_RATE = 1e-3

USE_MINI_MASK = True
MAX_GT_INSTANCES = 500

`

keven4ever commented 6 years ago

@maksimovkonstantin first of all thank you for share this interesting train schema, the purpose of this competition is to get hands dirty and gain some experience, i have to say what you shared did serve this purpose for me, thank you again!

@maksimovkonstantin @John1231983 you have shared different train schema and parameters, I wonder if you configuration/schema is re-producable? The reason I am asking is that, after got my best LB score, I tried to train it again either by continue with last epoch or start from epoch 0, never be able to get similar performance any more. This also happened to some other configurations I had. Also I tried different things which in theory should improve the performance, but in fact it just provided worse score. But I only tried with different things once, so I wonder if train it multiple time, maybe I will eventually get better score. Then this makes me think that such complicated network and so many hyper-parameters, maybe the result is not so re-producable. If this is the case, instead of trying different parameter and train schema just once, we shall stick to the configuration we believe and try several times. What do you guys think?

maksimovkonstantin commented 6 years ago

@keven4ever I also have the same issue with reproducibility, but I hope that last my scheme will be more stable.

John1231983 commented 6 years ago

One more thing I want to share that is convert images (color and gray) to same space like gray space. Then after obtain the result in inference, you can consider post processing that give me some gain. I think the challenge has many problem that deep learning may not handle , i.e different image space....

John1231983 commented 6 years ago

@maksimovkonstantin : In your function data_augmentation, you will augment image data with random rotation, zoom...How about its masks? It must apply same the random number (scale, angle) for its masks to make consistency

maksimovkonstantin commented 6 years ago

@John1231983 it augments both masks and image, this function as input takes list of images, where first one is image and other are masks.

John1231983 commented 6 years ago

@maksimovkonstantin : Great to hear that. However, I used this function and it provides an error. This is my script

image=dataset_train.load_image(0)
masks, class_ids = dataset_train.load_mask(0)
#Image shape of (256, 320, 3) and masks shape of (256, 320, 73)
input_aug=data_augmentation([image, masks])

This is error

    input_aug=data_augmentation([image, masks])
  File "augmentation_data.py", line 46, in data_augmentation
    output_images[i] = cv2.warpAffine(input_image, M, (input_image.shape[1], input_image.shape[0]))
cv2.error: /io/opencv/modules/imgproc/src/imgwarp.cpp:1825: error: (-215) ifunc != 0 in function remap

This is my opencv-python version

Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
>>> cv2.__version__
'3.3.1'

maksimovkonstantin commented 6 years ago

@John1231983 the list should be of images with shape 256 320 3, you should unpack stacked masks list to 73 mask images with 3 equal channels in each

John1231983 commented 6 years ago

@maksimovkonstantin : I have tried but it still errors. This is the shape of mask after I converted

masks_rgb_all=[]
for i in range(masks.shape[2]):
    mask=masks[:,:,i]
    masks_rgb = []
    for i in range (3):
        masks_rgb.append(mask)
    masks_rgb = np.stack(masks_rgb, axis=-1)
    masks_rgb_all.append(masks_rgb)
masks_rgb_all = np.stack(masks_rgb_all, axis=-1)
print (masks_rgb.shape,masks_rgb_all.shape)

input_aug=data_augmentation([image,masks_rgb_all])

(256, 320, 3) (256, 320, 3, 73) Error still

 output_images[i] = cv2.warpAffine(input_image, M, (input_image.shape[1], input_image.shape[0]))
cv2.error: /io/opencv/modules/imgproc/src/imgwarp.cpp:1825: error: (-215) ifunc != 0 in function remap

keven4ever commented 6 years ago

@John1231983 @maksimovkonstantin i can confirm that the training schema starting with heads then train all does improve the performance. You can see the performance figure below. the lowest red line is the one i got highest LB score (0.448), then upper red line is training only heads, then the green line shows when i train with all after 84 epoch.

The difference with my best record is that this time i have used more augmentation (flipping l/r, flipping u/d, rotating 360 degree, brightness, but i still have not applied zooming and cropping), but it seems augmentation makes the model a little bit under fitting.

Also i only used SGD with lr 0.001. According to this: https://shaoanlu.wordpress.com/2017/05/29/sgd-all-which-one-is-the-best-optimizer-dogs-vs-cats-toy-experiment/, SGD usually can find better local optimal solution than adaptive optimizer like Adam.

maksimovkonstantin commented 6 years ago

@keven4ever I have very close loss chrarts, but I can't reach mask loss as you close to 0.1, i think config is the key.

keven4ever commented 6 years ago

@maksimovkonstantin i am not sure if config is the key since the only difference here is data augmentation, the best performance one only used flip l/r augmentation and only train heads, the config is the same. So i am totally confused, in theory both augmentation and train entire network should improve the performance instead of reduce.

John1231983 commented 6 years ago

@keven4ever : As I know, we are working in pixel level, so scaling mask must be careful. As my experiment (I did not try augmentation), the post-processing is most important in this challenge

John1231983 commented 6 years ago

@keven4ever and @maksimovkonstantin : After training the dataset many time, I found the best way to achieve 0.44+ are

Using coco pretrain
Train head then train all. Number of training head biger than training all
Do not apply complex augmentation data, just fipud, fliplr are enough
Using SGD with clipnorm. Adam is faster but as @keven4ever mentioned, it difficult to achieve local optimal
Split dataset into some cluster likes gray, color, HSV... does not help improve performance. Just train the network with all types together.
Post processing like dilation, CRF... are importance

Do you agree these mentioned points? What your performance now @keven4ever ? Hope you can reproduce the LB with my above tips

keven4ever commented 6 years ago

@John1231983 based on experiments, it looks correct, however some of them doesn't really make sense, I suspect there is some special either in Mask R-CNN implementation or in dataset, for example:

why other data augmentation like rotating, brightness doesn't help with performance
why add more mask classes doesn't help improve the performance

I am not sure about the last bullet, I have not trie post processing like dilation nor CRF. The only post processing I did is to clean the masks overlapping, otherwise there is submission error.

John1231983 commented 6 years ago

@keven4ever : I think that the baseline maskrcnn using this repo achieved around 0.4+ LB. The performance also depends on the strategy of learning. What is your LB using COCO and head, all training?

For me, the dilation (post-processing) improve my score from 0.4 to 0.43 LB. It still lower than the baseline of maskrcnn with pytorch implementation (0.5+ LB)

matterport / Mask_RCNN

Suffer from overfitting #281