matterport / Mask_RCNN

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow
Other
24.7k stars 11.71k forks source link

Validation loss keeps fluctuating #2545

Open MahBadran93 opened 3 years ago

MahBadran93 commented 3 years ago

Hi all,

I am using this maskrcnn library to do detection and segmentation. I have this class distribution: Class_Occurrences = { 0:189 , 1:22, 2:1, 3:40, 4:28, 5:85, 6:40, 7:63, 8:42, 9:5 } key: class_id, value: number of occurrences. First class with key 0 is the background.

Data set contains 189 training images and 53 validation images.

  1. Training process 1 : 100 epoch, pre trained coco weights, without augmentation. the result mAP : 0.17
  2. Training process 2 : 100 epoch, pre trained coco weights, with online augmentation. the result mAP : 0.29 Augmentation Config: augmentation = iaa.SomeOf((0, 3), [ iaa.Fliplr(0.5), iaa.Flipud(0.5), iaa.OneOf([iaa.Affine(rotate=90), iaa.Affine(rotate=180), iaa.Affine(rotate=270)]), iaa.Multiply((0.8, 1.5)), iaa.GaussianBlur(sigma=(0.0, 5.0)) ])
Below you can see the training and validation loss for process 2: Training losses Validation losses
loss1 valloss1
loss2 2
loss3 3
loss4 4
loss5 5
loss6 6

my question is, why the mAP is so low ? what I can do to increase the performance ? and why the training loss decreasing while validation loss in not (fluctuating) ? I tried to add class_weight to work around the data imbalanced but I always get this error : Unknown entries in class_weight dictionary: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. Only expected following keys: []

Model Configuration:

Name Value
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.9
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 22
IMAGE_MIN_DIM 800
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAM E object
NUM_CLASSES 10
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 100
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 100
WEIGHT_DECAY 0.0001
TimNagle-McNaughton commented 3 years ago
  1. It seems pretty obvious to me that your model is immediately overfitting. Your validation loss is almost double your training loss immediately. I would think that the learning rate may be too high, and would try reducing it.

I recommend this blog.

  1. mAP will vary based on your threshold and IoU. Try reducing the threshold and visualize some results to see if that's better.
  2. Your validation loss is varying wildly because your validation set is likely not representative of the whole dataset. I would recommend shuffling/resampling the validation set, or using a larger validation fraction.
will6309 commented 3 years ago

Hi MahBadran93,

  1. It showed some sort of overfitting. because if you draw a line best fit the val loss, it is going down and then going up while your train loss keeps going down.
  2. It also showed signs of the training dataset maybe not representative enough, and the model didn't learn enough to perform the task. make sure that you feed the right images to your model.
mansi-aggarwal-2504 commented 3 years ago

I have a question. Does mask rcnn not adjust its weights and learning on the basis of validation dataset after each epoch. Like I have a a dataset divided into train, val and test. Train and val are supplied for training. And if I run the the model on validation dataset, the results are quite poor itself, let alone test dataset. This means validation dataset is not used for training? Just for us to check our val score while training is going on?

TimNagle-McNaughton commented 3 years ago

The validation set is used to validate training.

After each step, the current model is tested on the validation set. This test determines if the last training improved the model or not. So the validation set is not explicitly used to train the model, but it used in training if that makes sense.

mansi-aggarwal-2504 commented 3 years ago

@TimNagle-McNaughton, thank you for your reply. So if my validation score is not improving, does the training model learn that and adjust its weights? That would mean it learns on both train and val dataset and if that is so, the resultant model should not perform that poorly on val dataset. Am I getting it correctly? My validation score doesn't improve after 40 epochs itself and the trained model is unable to segment most of the objects in validation/test datasets. Any ideas on how to improve training.

I tried something. I wanted to retrain all layers of the backbone network on my custom dataset. For which I set TRAIN_BN = True in config.py. Am I correct here? Will this mean no layer would be frozen while training?

TimNagle-McNaughton commented 3 years ago

So if my validation score is not improving, does the training model learn that and adjust its weights?

Broadly, yes.

the resultant model should not perform that poorly on val dataset

Correct.

For which I set TRAIN_BN = True

I'm not familiar with that flag sorry.

mansi-aggarwal-2504 commented 3 years ago

the resultant model should not perform that poorly on val dataset

Correct.

I guess my trained model is not efficient then because it is in fact performing poorly on val set. Thanks anyway @TimNagle-McNaughton

MahBadran93 commented 3 years ago
  1. It seems pretty obvious to me that your model is immediately overfitting. Your validation loss is almost double your training loss immediately. I would think that the learning rate may be too high, and would try reducing it.

I recommend this blog.

  1. mAP will vary based on your threshold and IoU. Try reducing the threshold and visualize some results to see if that's better.
  2. Your validation loss is varying wildly because your validation set is likely not representative of the whole dataset. I would recommend shuffling/resampling the validation set, or using a larger validation fraction.

Thank you @TimNagle-McNaughton for your answer.

MahBadran93 commented 3 years ago

Hi MahBadran93,

  1. It showed some sort of overfitting. because if you draw a line best fit the val loss, it is going down and then going up while your train loss keeps going down.
  2. It also showed signs of the training dataset maybe not representative enough, and the model didn't learn enough to perform the task. make sure that you feed the right images to your model.

You are right, the dataset was not representative enough and that was the main issue.

raulperezalejo commented 2 years ago

Hello I am facing this same problem. Based on previous answers i have adjusted my data split. I have used a 80-20(original split), tried 90-10 and 70-30, but i get the same result, epoch_loss looks awesome but validation_loss keeps fluctuating. I am only training heads, no matter the epoch amount, fluctuate. Reading elsewhere said that a possible cause could be my model is too complex but that argument does not fit here i think.

This is the dataset i am using https://github.com/dsmlr/Car-Parts-Segmentation/

id appreciate any advice where to continue looking.

BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 35
DETECTION_MIN_CONFIDENCE 0.7
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 512
IMAGE_META_SIZE 32
IMAGE_MIN_DIM 512
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [512 512 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME car_parts
NUM_CLASSES 20
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 500
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK False
USE_RPN_ROIS True
VALIDATION_STEPS 100
WEIGHT_DECAY 0.0001

UPDATE: It was fluctuating because my Dataset already has a background annotation. When creating my custom Dataset, this created two background classes resulting in problems when training. Now my training is not fluctuating any more.

jjavv commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else?

Network1:

loss_plot (1)

accuracy_plot (1)

Network 2

accuracy_plot loss_plot

Savant-HO commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else?

Network1:

loss_plot (1)

accuracy_plot (1)

Network 2

accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

jjavv commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

Savant-HO commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

If you solve it one day, please tell me! Thank you, sys!

MahBadran93 commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

You need to solve the data imbalance problem. It can be the main reason for the bad results. You want to make sure that you have an equal distribution for each class across train, val and test. You can try augmentation.

jjavv commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

You need to solve the data imbalance problem. It can be the main reason for the bad results. You want to make sure that you have an equal distribution for each class across train, val and test. You can try augmentation.

I tried data augmentation but Alexnet pretrained showed skipped class prediction in classification report nd accuracy is very low. I did for mnist dataset it gave 98% but for ecg dataset it was 48% and my classification report shows few classes precision/recall 0

2022kaishi commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: loss_plot (1) accuracy_plot (1) Network 2 accuracy_plot loss_plot

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

hi guys i m facing the same issue. here is my advice

  1. check out your dataset. The same preprocessing method should be applied to all datasets. augmentation, rescale etc
  2. using CALLBACK API in KERAS. Keep reducing the learning rate. this method helped me out

I hope it was helpful.