Unable to reproduce models and Increasing validation loss

Hello and thank you for the great work.

While working with this project I came across a few problems and I hope you could give me some suggestions.

1. Unable to reproduce models

Firstly I tried reproducing one of the LR+HR trainings, InSPyReNet_SwinB_HU (HRSOD-TR and UHRSD-TR), but I do not obtain the same results. I gathered the results in the following table:

Dataset	Model	Sm	mae	adpEm	maxEm	avgEm	adpFm	maxFm	avgFm	wFm	mBA
DUTS-TE	yours	0.939	0.0221	0.931	0.9657	0.951	0.865	0.936	0.908	0.901	0.735
DUTS-TE	mine	0.882	0.0396	0.897	0.909	0.889	0.799	0.847	0.8185	0.799	0.6437
HRSOD-TE	yours	0.9565	0.0173	0.9527	0.9746	0.9641	0.9090	0.9564	0.933	0.9234	0.7714
HRSOD-TE	mine	0.9136	0.0322	0.9023	0.9370	0.9199	0.815	0.8934	0.8579	0.8304	0.6412
UHRSD-TE	yours	0.9528	0.02038	0.9223	0.9708	0.9617	0.9029	0.9576	0.9431	0.9331	0.7897
UHRSD-TE	mine	0.9202	0.0332	0.9133	0.9477	0.9316	0.8615	0.9179	0.8967	0.8713	0.6621

Although the metrics are quite close, the quality of the predictions with the model trained by me are far more inferior than the provided model. I also tried training the PlusUltraHR model and I am experiencing the same thing. Why could this happen? Why can I not reproduce the model?

2. Increasing loss during validation

Additionally, I added validation to the training script in order to monitor the model's performance during training:


for epoch in epoch_iter:

        if args.local_rank <= 0 and args.verbose is True:
            step_iter = tqdm.tqdm(enumerate(train_loader, start=1), desc='Iter', total=len(
                train_loader), position=1, leave=False, bar_format='{desc:<5.5}{percentage:3.0f}%|{bar:40}{r_bar}')
            if args.device_num > 1 and train_sampler is not None:
                train_sampler.set_epoch(epoch)
        else:
            step_iter = enumerate(train_loader, start=1)

        train_loss = []

        for i, sample in step_iter:
            optimizer.zero_grad()
            if opt.Train.Optimizer.mixed_precision is True and scaler is not None:
                with autocast():
                    sample = to_cuda(sample)
                    out = model(sample)

                scaler.scale(out['loss']).backward()
                scaler.step(optimizer)
                scaler.update()
                scheduler.step()
            else:
                sample = to_cuda(sample)
                out = model(sample)
                out['loss'].backward()
                optimizer.step()
                scheduler.step()

            if args.local_rank <= 0 and args.verbose is True:
                step_iter.set_postfix({'loss': out['loss'].item()})

            train_loss.append(out['loss'].item())

        average_loss = np.mean(train_loss)

        step_iter_test = enumerate(test_loader, start=1)

        # model.eval()
        df = df.append({'epoch': epoch, 'scope': 'train', 'set': 'all', 'metric': 'loss', 'value': average_loss}, ignore_index=True)
        writer.add_scalar('Train/loss', average_loss, epoch)

        mse_sum = {}
        loss_sum = {}
        count = {}

        with torch.no_grad():
            for i, sample in step_iter_test:
                sample = to_cuda(sample) #ads 50 MB to GPU memory
                set_name = sample['set'][0]
                out = model(sample)
                loss = out['loss'].detach().cpu().numpy()
                pred = to_numpy(out['pred'], sample['shape'])
                gt = to_numpy(out['gt'], sample['shape'])
                mse = compute_mse(predict=pred, alpha=gt)

                if set_name not in mse_sum:
                    mse_sum[set_name] = 0.0
                    loss_sum[set_name] = 0.0
                    count[set_name] = 0
                mse_sum[set_name] += mse
                loss_sum[set_name] += loss
                count[set_name] += 1

            for set_name in mse_sum:
                mean_mse = mse_sum[set_name] / count[set_name]
                mean_loss = loss_sum[set_name] / count[set_name]
                df = df.append({'epoch': epoch, 'scope': 'valid', 'set': set_name, 'metric': 'mse', 'value': mean_mse}, ignore_index=True)
                df = df.append({'epoch': epoch, 'scope': 'valid', 'set': set_name, 'metric': 'loss', 'value': mean_loss}, ignore_index=True)
                writer.add_scalar('Valid/' + set_name + '/mse', mean_mse, epoch)
                writer.add_scalar('Valid/' + set_name + '/loss', mean_loss, epoch)

        df_path = os.path.join(opt.Train.Checkpoint.checkpoint_dir, f'{log_id}.json')
        df.to_json(df_path, orient='records')

        model.train()

InSPyReNet_SwinB_HU training & validation

For InSPyReNet_SwinB_HU training, the validation set I used is DUTS-TE. The training loss is constantly decreasing but the validation loss starts increasing after some epochs:

My assumptions were the following: the model is overfitting or the data distribution between train sets and test set is too different.

Overfitting Check

To check if overfitting is a problem I trained a LR model (using Plus_Ultra_LR config) on 43K samples ('MSRA-10K','HRSOD-TR','HRSOD-TE','ECSSD','HKU-IS','PASCAL-S','DAVIS','UHRSD-TR','UHRSD-TE','FSS-1000','DIS5K') and validated the model after each epoch on 300 images from DUTS-TE. I chose to train a LR model and only a subset of DUTS-TE for faster training. The loss during validation is still increasing:

I know that overfitting occurs when the training set has a small number of samples or the model is complex. After this experiment with 43K images I doubt that overfitting is responsible for the increase in validation loss.

Difference in data distribution Check

I was also thinking that the difference of data distribution between training sets might be too big and the model struggles in finding the optimum to accommodate all cases, making it hard to generalize. To test this, I decided to train a LR model on UHRSD2K-TR only for 150 epochs and validate the model on several testing sets:

I was expecting the loss to decrease for UHRSD2K-TE and increase for HRSOD-TE and PASCAL-S but the validation loss increases for all tests. Along the mentioned experiments, I have trained InspyreNet with different configurations and datasets and for each one the loss during validation increases. What can be the problem? Why is the validation loss always increasing?

plemeri / InSPyReNet