meetps / pytorch-semseg

Semantic Segmentation Architectures Implemented in PyTorch
https://meetshah.dev/semantic-segmentation/deep-learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years.html
MIT License
3.39k stars 797 forks source link

loss increase a lot when training on pspnet #167

Open sanweiliti opened 5 years ago

sanweiliti commented 5 years ago

Hi, I got validation result of ~ 78% for mIoU on cityscapes with pspnet model, but when I try to finetune this model on the training set of cityscapes, after I did one back propagation, the training loss and validation loss got crazily high, and the mIoU drops a lot, anyone know why? Does this has anything to do with the batch normalization?

adam9500370 commented 5 years ago

Could you share your training settings (i.e., optimizer, learning rate, image size, ... in config file)?

sanweiliti commented 5 years ago

Hi, I'm using the following config:

model:
    arch: pspnet
    version: cityscapes

data:
    dataset: cityscapes
    train_split: train
    val_split: val
    test_split: test
    img_rows: 257
    img_cols: 513
    img_norm: False
    path: ./datasets/cityscapes
    version: pascal # pascal mean for pspNet

training:
    train_iters: 1000
    batch_size: 2
    val_interval: 5
    n_workers: 2
    print_interval: 1
    optimizer:
        name: 'adam'
        lr: 1.0e-4
    loss:
        name: 'multi_scale_cross_entropy'
        size_average: True
    lr_schedule:
    resume:  

And I load the trained weights via load_pretrained_model() function, which is okay for validation. Due to the resolution, this config can only reach ~61% mIoU for validation, but after training for one iteration, the mIoU will drop to 40%, and can not get back to 61% anymore. I just used the nomral training procedure in train.py, nothing special.

fabvio commented 5 years ago

It happens also to me when I try to train with resized images. +1

Edit: also, I'm training with batchsize 8, so I suppose there is a problem with the training procedure.

ghost commented 5 years ago

Did you solve this problem? I am facing the same problem.

fabvio commented 5 years ago

No, I had to change training routine. I suppose that some of the strategies implemented in this repo simply don't work with huge architectures like pspnet.