Validation loss keeps fluctuating

MahBadran93 commented 3 years ago

Hi all,

I am using this maskrcnn library to do detection and segmentation. I have this class distribution: Class_Occurrences = { 0:189 , 1:22, 2:1, 3:40, 4:28, 5:85, 6:40, 7:63, 8:42, 9:5 } key: class_id, value: number of occurrences. First class with key 0 is the background.

Data set contains 189 training images and 53 validation images.

Training process 1 : 100 epoch, pre trained coco weights, without augmentation. the result mAP : 0.17
Training process 2 : 100 epoch, pre trained coco weights, with online augmentation. the result mAP : 0.29 Augmentation Config: augmentation = iaa.SomeOf((0, 3), [ iaa.Fliplr(0.5), iaa.Flipud(0.5), iaa.OneOf([iaa.Affine(rotate=90), iaa.Affine(rotate=180), iaa.Affine(rotate=270)]), iaa.Multiply((0.8, 1.5)), iaa.GaussianBlur(sigma=(0.0, 5.0)) ])

Below you can see the training and validation loss for process 2: Training losses	Validation losses

my question is, why the mAP is so low ? what I can do to increase the performance ? and why the training loss decreasing while validation loss in not (fluctuating) ? I tried to add class_weight to work around the data imbalanced but I always get this error : Unknown entries in class_weight dictionary: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. Only expected following keys: []

Model Configuration:

Name	Value
BACKBONE	resnet101
BACKBONE_STRIDES	[4, 8, 16, 32, 64]
BATCH_SIZE	1
BBOX_STD_DEV	[0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE	None
DETECTION_MAX_INSTANCES	100
DETECTION_MIN_CONFIDENCE	0.9
DETECTION_NMS_THRESHOLD	0.3
FPN_CLASSIF_FC_LAYERS_SIZE	1024
GPU_COUNT	1
GRADIENT_CLIP_NORM	5.0
IMAGES_PER_GPU	1
IMAGE_CHANNEL_COUNT	3
IMAGE_MAX_DIM	1024
IMAGE_META_SIZE	22
IMAGE_MIN_DIM	800
IMAGE_MIN_SCALE	0
IMAGE_RESIZE_MODE	square
IMAGE_SHAPE	[1024 1024 3]
LEARNING_MOMENTUM	0.9
LEARNING_RATE	0.001
LOSS_WEIGHTS	{'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE	14
MASK_SHAPE	[28, 28]
MAX_GT_INSTANCES	100
MEAN_PIXEL	[123.7 116.8 103.9]
MINI_MASK_SHAPE	(56, 56)
NAM	E	object
NUM_CLASSES	10
POOL_SIZE	7
POST_NMS_ROIS_INFERENCE	1000
POST_NMS_ROIS_TRAINING	2000
PRE_NMS_LIMIT	6000
ROI_POSITIVE_RATIO	0.33
RPN_ANCHOR_RATIOS	[0.5, 1, 2]
RPN_ANCHOR_SCALES	(32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE	1
RPN_BBOX_STD_DEV	[0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD	0.7
RPN_TRAIN_ANCHORS_PER_IMAGE	256
STEPS_PER_EPOCH	100
TOP_DOWN_PYRAMID_SIZE	256
TRAIN_BN	False
TRAIN_ROIS_PER_IMAGE	200
USE_MINI_MASK	True
USE_RPN_ROIS	True
VALIDATION_STEPS	100
WEIGHT_DECAY	0.0001

TimNagle-McNaughton commented 3 years ago

It seems pretty obvious to me that your model is immediately overfitting. Your validation loss is almost double your training loss immediately. I would think that the learning rate may be too high, and would try reducing it.

I recommend this blog.

mAP will vary based on your threshold and IoU. Try reducing the threshold and visualize some results to see if that's better.
Your validation loss is varying wildly because your validation set is likely not representative of the whole dataset. I would recommend shuffling/resampling the validation set, or using a larger validation fraction.

will6309 commented 3 years ago

Hi MahBadran93,

It showed some sort of overfitting. because if you draw a line best fit the val loss, it is going down and then going up while your train loss keeps going down.
It also showed signs of the training dataset maybe not representative enough, and the model didn't learn enough to perform the task. make sure that you feed the right images to your model.

mansi-aggarwal-2504 commented 3 years ago

I have a question. Does mask rcnn not adjust its weights and learning on the basis of validation dataset after each epoch. Like I have a a dataset divided into train, val and test. Train and val are supplied for training. And if I run the the model on validation dataset, the results are quite poor itself, let alone test dataset. This means validation dataset is not used for training? Just for us to check our val score while training is going on?

TimNagle-McNaughton commented 3 years ago

The validation set is used to validate training.

After each step, the current model is tested on the validation set. This test determines if the last training improved the model or not. So the validation set is not explicitly used to train the model, but it used in training if that makes sense.

mansi-aggarwal-2504 commented 3 years ago

@TimNagle-McNaughton, thank you for your reply. So if my validation score is not improving, does the training model learn that and adjust its weights? That would mean it learns on both train and val dataset and if that is so, the resultant model should not perform that poorly on val dataset. Am I getting it correctly? My validation score doesn't improve after 40 epochs itself and the trained model is unable to segment most of the objects in validation/test datasets. Any ideas on how to improve training.

I tried something. I wanted to retrain all layers of the backbone network on my custom dataset. For which I set TRAIN_BN = True in config.py. Am I correct here? Will this mean no layer would be frozen while training?

TimNagle-McNaughton commented 3 years ago

So if my validation score is not improving, does the training model learn that and adjust its weights?

Broadly, yes.

the resultant model should not perform that poorly on val dataset

Correct.

For which I set TRAIN_BN = True

I'm not familiar with that flag sorry.

mansi-aggarwal-2504 commented 3 years ago

the resultant model should not perform that poorly on val dataset

Correct.

I guess my trained model is not efficient then because it is in fact performing poorly on val set. Thanks anyway @TimNagle-McNaughton

MahBadran93 commented 3 years ago

It seems pretty obvious to me that your model is immediately overfitting. Your validation loss is almost double your training loss immediately. I would think that the learning rate may be too high, and would try reducing it.

I recommend this blog.

mAP will vary based on your threshold and IoU. Try reducing the threshold and visualize some results to see if that's better.

Your validation loss is varying wildly because your validation set is likely not representative of the whole dataset. I would recommend shuffling/resampling the validation set, or using a larger validation fraction.

Thank you @TimNagle-McNaughton for your answer.

MahBadran93 commented 3 years ago

Hi MahBadran93,

It showed some sort of overfitting. because if you draw a line best fit the val loss, it is going down and then going up while your train loss keeps going down.

It also showed signs of the training dataset maybe not representative enough, and the model didn't learn enough to perform the task. make sure that you feed the right images to your model.

You are right, the dataset was not representative enough and that was the main issue.

raulperezalejo commented 2 years ago

Hello I am facing this same problem. Based on previous answers i have adjusted my data split. I have used a 80-20(original split), tried 90-10 and 70-30, but i get the same result, epoch_loss looks awesome but validation_loss keeps fluctuating. I am only training heads, no matter the epoch amount, fluctuate. Reading elsewhere said that a possible cause could be my model is too complex but that argument does not fit here i think.

This is the dataset i am using https://github.com/dsmlr/Car-Parts-Segmentation/

id appreciate any advice where to continue looking.

BACKBONE	resnet101
BACKBONE_STRIDES	[4, 8, 16, 32, 64]
BATCH_SIZE	1
BBOX_STD_DEV	[0.1 0.1 0.2 0.2]
BBOX_STD_DEV	[0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE	None
DETECTION_MAX_INSTANCES	35
DETECTION_MIN_CONFIDENCE	0.7
DETECTION_NMS_THRESHOLD	0.3
FPN_CLASSIF_FC_LAYERS_SIZE	1024
GPU_COUNT	1
GRADIENT_CLIP_NORM	5.0
IMAGES_PER_GPU	1
IMAGE_CHANNEL_COUNT	3
IMAGE_MAX_DIM	512
IMAGE_META_SIZE	32
IMAGE_MIN_DIM	512
IMAGE_MIN_SCALE	0
IMAGE_RESIZE_MODE	square
IMAGE_SHAPE	[512 512 3]
LEARNING_MOMENTUM	0.9
LEARNING_RATE	0.001
LOSS_WEIGHTS	{'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE	14
MASK_SHAPE	[28, 28]
MAX_GT_INSTANCES	100
MEAN_PIXEL	[123.7 116.8 103.9]
MINI_MASK_SHAPE	(56, 56)
NAME	car_parts
NUM_CLASSES	20
POOL_SIZE	7
POST_NMS_ROIS_INFERENCE	1000
POST_NMS_ROIS_TRAINING	2000
PRE_NMS_LIMIT	6000
ROI_POSITIVE_RATIO	0.33
RPN_ANCHOR_RATIOS	[0.5, 1, 2]
RPN_ANCHOR_SCALES	(32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE	1
RPN_BBOX_STD_DEV	[0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD	0.7
RPN_TRAIN_ANCHORS_PER_IMAGE	256
STEPS_PER_EPOCH	500
TOP_DOWN_PYRAMID_SIZE	256
TRAIN_BN	False
TRAIN_ROIS_PER_IMAGE	200
USE_MINI_MASK	False
USE_RPN_ROIS	True
VALIDATION_STEPS	100
WEIGHT_DECAY	0.0001

UPDATE: It was fluctuating because my Dataset already has a background annotation. When creating my custom Dataset, this created two background classes resulting in problems when training. Now my training is not fluctuating any more.

jjavv commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else?

Network1:

accuracy_plot (1)

Network 2

accuracy_plot loss_plot

Savant-HO commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else?

Network1:

Network 2

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

jjavv commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: Network 2

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

Savant-HO commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: Network 2

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

If you solve it one day, please tell me! Thank you, sys!

MahBadran93 commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: Network 2

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

You need to solve the data imbalance problem. It can be the main reason for the bad results. You want to make sure that you have an equal distribution for each class across train, val and test. You can try augmentation.

jjavv commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: Network 2

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

I couldn't come to any conclusion.

You need to solve the data imbalance problem. It can be the main reason for the bad results. You want to make sure that you have an equal distribution for each class across train, val and test. You can try augmentation.

I tried data augmentation but Alexnet pretrained showed skipped class prediction in classification report nd accuracy is very low. I did for mnist dataset it gave 98% but for ecg dataset it was 48% and my classification report shows few classes precision/recall 0

2022kaishi commented 1 year ago

I got these My dataset has imbalance problem but is it only this reason or something else? Network1: Network 2

Hello , I meet this problem too. Can you tell me how to solve this problem? Thanks!

hi guys i m facing the same issue. here is my advice

check out your dataset. The same preprocessing method should be applied to all datasets. augmentation, rescale etc
using CALLBACK API in KERAS. Keep reducing the learning rate. this method helped me out

I hope it was helpful.

matterport / Mask_RCNN

Validation loss keeps fluctuating #2545