matterport / Mask_RCNN

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow
Other
24.6k stars 11.69k forks source link

Loss increase when lr lowered #2548

Open sohinimallick opened 3 years ago

sohinimallick commented 3 years ago

I am trying to train the model with a training schedule where the lr is reduced /10 every few epochs. For example, heads trained for 10 epochs at lr, 4+ for next 10 epochs at lr/10, all layers for another 10 epochs at lr/100 (total 30 epochs). However the last increases after every training stage rather than going down i.e. if it finishes the first 10 at a loss value = 1.6, the next stage starts at a loss of 3.4. Does anyone know the reason behind this.

config.LEARNING_RATE=0.001 Training schedule:

model.train(train_set, val_set, learning_rate=config.LEARNING_RATE/10, epochs=10, augmentation=augmentation, layers='heads') model.train(train_set, val_set, learning_rate=config.LEARNING_RATE/100, epochs=20, augmentation=augmentation, layers='4+') model.train(train_set, val_set, learning_rate=config.LEARNING_RATE/1000, epochs=30, augmentation=augmentation, layers='all')

Configurations:

Configuration

class LIDARConfig(Config): """Configuration for training on LIDAR dataset. Derives from the base Config class and overrides values specific to the LIDAR dataset. """

Give the configuration a recognizable name

NAME = "LIDAR_Celtic"

# Train on 1 GPU and 8 images per GPU. We can put multiple images on each
# GPU because the images are small. Batch size is 8 (GPUs * images/GPU).
GPU_COUNT = 1
IMAGES_PER_GPU = 1
# Number of classes (including background)
NUM_CLASSES = 1 + 1  # background + 1 shape (Celtic)

# Use small images for faster training. Set the limits of the small side
# the large side, and that determines the image shape.
#IMAGE_MIN_DIM = 128
#IMAGE_MAX_DIM = 128
IMAGE_MAX_DIM = 512
IMAGE_MIN_DIM = 320

#MASK_SHAPE = [56,56]

# Use smaller anchors because our image and objects are small
#RPN_ANCHOR_SCALES = (8, 16, 32, 64, 128)  # anchor side in pixels
#RPN_ANCHOR_SCALES = (32, 64, 128, 256, 512)
RPN_ANCHOR_SCALES = (16, 32, 64, 128, 256)
RPN_TRAIN_ANCHORS_PER_IMAGE = 128

# Reduce training ROIs per image because the images are small and have
# few objects. Aim to allow ROI sampling to pick 33% positive ROIs.
TRAIN_ROIS_PER_IMAGE = 100

# Use a small epoch since the data is simple
STEPS_PER_EPOCH = 500

# use small validation steps since the epoch is small
VALIDATION_STEPS = 50

# Maximum number of ground truth instances to use in one image
MAX_GT_INSTANCES = 40
# Max number of final detections
DETECTION_MAX_INSTANCES = 40
# Minimum probability value to accept a detected instance
# ROIs below this threshold are skipped
DETECTION_MIN_CONFIDENCE = 0.8

USE_MINI_MASK = False

LOSS_WEIGHTS = {
    "rpn_class_loss": 1.,
    "rpn_bbox_loss": 1.,
    "mrcnn_class_loss": 1.,
    "mrcnn_bbox_loss": 1.,
    "mrcnn_mask_loss": 1.
}
TimNagle-McNaughton commented 3 years ago

Seems like whatever lr modifications you're doing are at fault. I would remove them and see if the problem persists, that would indicate a data problem if it remains.

konstantin-frolov commented 2 years ago

I have the same issue after updating to tensorflow 2.4 Each next training process performing an increase in train and val loss without any changes. Do you have a solution?