mvoelk / ssd_detectors

SSD-based object and text detection with Keras, SSD, DSOD, TextBoxes, SegLink, TextBoxes++, CRNN
MIT License
302 stars 85 forks source link

DSOD Low mAP #44

Open AdamCuellar opened 4 years ago

AdamCuellar commented 4 years ago

Not necessarily an issue, but the mAP I got from DSOD512 training on VOC 07+12 and testing on 07 was quite low, approximately 0.13.

Only thing I really changed was using Adam instead of AdamAccumulate because it throws an error on tf 2.0. I also used softmax.

Also, metrics don't show during training other than the loss itself.

def trainMultiGPU():
    # set up data sets
    gt_util_voc = GTUtility("data/VOC2012train/")
    gt_util_voc7 = GTUtility("data/VOC2007train/")
    gt_util_voc_val = GTUtility("data/VOC2012val/", validation=True)
    gt_util_voc7_val = GTUtility("data/VOC2007val/", validation=True)

    gt_util_train = GTUtility.merge(gt_util_voc, gt_util_voc7)
    gt_util_val = GTUtility.merge(gt_util_voc_val, gt_util_voc7_val)

    experiment = 'dsod300_voc12_7'
    batch_size = 16

    # class_weights = prior_util.compute_class_weights(gt_util_train)
    class_weights = np.array(
        [0.00007169, 1.20864663, 1.23607288, 0.81087541, 1.32018959, 1.65339534, 1.47852761, 0.45099343, 0.84154551,
         0.33765636, 1.41315118, 1.32907548, 0.63492811, 1.15680594, 1.18978997, 0.07548318, 0.91531396, 1.21262288,
         1.15910985, 1.49269817, 1.08304682])

    # DSOD paper
    # batch size 128
    # 320k iterations
    # initial learning rate 0.1

    epochs = 1000
    initial_epoch = 0

    with tf.device("/cpu:0"):
        # set up DSOD 512
        model = DSOD512(num_classes=gt_util_train.num_classes, softmax=True)

    prior_util = PriorUtil(model)
    gen_train = InputGenerator(gt_util_train, prior_util, batch_size, model.image_size, augmentation=True)
    gen_val = InputGenerator(gt_util_val, prior_util, batch_size, model.image_size, augmentation=True)

    # weight decay
    regularizer = keras.regularizers.l2(5e-4)  # None if disabled
    for l in model.layers:
        if l.__class__.__name__.startswith('Conv'):
            l.kernel_regularizer = regularizer

    checkdir = './checkpoints/' + time.strftime('%Y%m%d%H%M') + '_' + experiment
    if not os.path.exists(checkdir):
        os.makedirs(checkdir)

    optim = keras.optimizers.Adam(lr=1e-3)

    # loss = SSDLoss(alpha=1.0, neg_pos_ratio=3.0)
    loss = SSDFocalLoss(lambda_conf=1.0, class_weights=class_weights)

    model = multi_gpu_model(model, gpus=2)
    model.compile(optimizer=optim, loss=loss.compute, metrics=loss.metrics)

    # add some callbacks
    reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1)
    early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1)

    history = model.fit(
        gen_train.generate(),
        steps_per_epoch=gen_train.num_batches,
        epochs=epochs,
        verbose=1,
        callbacks=[
            keras.callbacks.ModelCheckpoint(checkdir + '/weights.{epoch:03d}.h5', verbose=1, save_weights_only=True,
                                            save_best_only=True, period=3),
            Logger(checkdir),
            reduce_lr,
            early_stopping
        ],
        validation_data=gen_val.generate(),
        validation_steps=gen_val.num_batches,
        class_weight=None,
        workers=1,
        use_multiprocessing=False,
        initial_epoch=initial_epoch)
mvoelk commented 4 years ago

I had convergence issues with small batch size and was forced to use AdamAccumulate. The initial learning rate of 0.1 and the batch size of 128 were already suspicious to me.

The missing metricas are a known issue. They are more or less a hack and do not work with tf-keras and probably not with multi GPU either. I did not have the time to fix the tf 2 training.

See also #14 and #25.