icnet error (icnet return tuple but not write that logic)

lucasjinreal commented 6 years ago

Hi, icnet returned a tuple when training.... but when calculating loss, it directly get size from tuple and got this error:

Traceback (most recent call last):
  File "train.py", line 230, in <module>
    train(cfg, writer, logger)
  File "train.py", line 132, in train
    loss = loss_fn(input=outputs, target=labels)
  File "pytorch-semseg/ptsemseg/loss/loss.py", line 10, in cross_entropy2d
    n, c, h, w = input.size()
AttributeError: 'tuple' object has no attribute 'size'

adam9500370 commented 6 years ago

Hi, @jinfagang . You can set multi_scale_cross_entropy loss function in config file.

loss:
    name: 'multi_scale_cross_entropy'

And change 'exponent' tensor type to float and set the corresponding device (in ptsemseg/loss/loss.py#L36):

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if target.is_cuda else 'cpu')

lucasjinreal commented 6 years ago

@adam9500370 Hi, I finally able to train on icnet. However, after 10k more iterations, the mean iOU seems not right at all:

27it [00:02, 16.36it/s]WARN: resizing labels yielded fewer classes
500it [00:26, 29.90it/s]
Overall Acc:     0.4196301378853902
Mean Acc :   0.15644619030428067
FreqW Acc :      0.31476421378091346
Mean IoU :   0.09229576247351066
Iter [194050/300000]  Loss: 816839.8750  Time/Image: 0.1126
Iter [194100/300000]  Loss: 548733.3750  Time/Image: 0.1123
Iter [194150/300000]  Loss: 898010.5625  Time/Image: 0.1130
Iter [194200/300000]  Loss: 646011.3125  Time/Image: 0.1125
Iter [194250/300000]  Loss: 968136.6250  Time/Image: 0.1122
Iter [194300/300000]  Loss: 655537.1875  Time/Image: 0.1125
Iter [194350/300000]  Loss: 673936.6250  Time/Image: 0.1127
Iter [194400/300000]  Loss: 556652.3750  Time/Image: 0.1128
Iter [194450/300000]  Loss: 751962.5000  Time/Image: 0.1116
Iter [194500/300000]  Loss: 685939.0625  Time/Image: 0.1128
Iter [194550/300000]  Loss: 653181.4375  Time/Image: 0.1128
Iter [194600/300000]  Loss: 596467.0625  Time/Image: 0.1117
Iter [194650/300000]  Loss: 947831.4375  Time/Image: 0.1131
Iter [194700/300000]  Loss: 603308.4375  Time/Image: 0.1123
Iter [194750/300000]  Loss: 470650.3438  Time/Image: 0.1125
Iter [194800/300000]  Loss: 461287.7500  Time/Image: 0.1140
Iter [194850/300000]  Loss: 803597.2500  Time/Image: 0.1140
Iter [194900/300000]  Loss: 580953.6875  Time/Image: 0.1157
Iter [194950/300000]  Loss: 472815.9375  Time/Image: 0.1151
Iter [195000/300000]  Loss: 620432.0625  Time/Image: 0.1165
26it [00:02, 16.84it/s]WARN: resizing labels yielded fewer classes
500it [00:26, 18.86it/s]
Overall Acc:     0.43595608380925455
Mean Acc :   0.14414920306903656
FreqW Acc :      0.30780209512001516
Mean IoU :   0.09285922375025128
Iter [195050/300000]  Loss: 584194.6875  Time/Image: 0.1131
Iter [195100/300000]  Loss: 579036.9375  Time/Image: 0.1129
Iter [195150/300000]  Loss: 761244.0000  Time/Image: 0.1124
Iter [195200/300000]  Loss: 789020.6875  Time/Image: 0.1127
Iter [195250/300000]  Loss: 497891.0312  Time/Image: 0.1132
Iter [195300/300000]  Loss: 814943.5625  Time/Image: 0.1123
Iter [195350/300000]  Loss: 719462.1250  Time/Image: 0.1126
Iter [195400/300000]  Loss: 583933.4375  Time/Image: 0.1119
Iter [195450/300000]  Loss: 510635.5000  Time/Image: 0.1145
Iter [195500/300000]  Loss: 540089.3125  Time/Image: 0.1137
Iter [195550/300000]  Loss: 678339.6875  Time/Image: 0.1141
Iter [195600/300000]  Loss: 1116914.5000  Time/Image: 0.1133
Iter [195650/300000]  Loss: 574083.0625  Time/Image: 0.1158

the loss is too big, and the mean IOU is totally wrong.......... Any idea about this?

adam9500370 commented 6 years ago

Could you share your training settings (i.e., # of classes (dataset), optimizer, learning rate, image size, ...)?

lucasjinreal commented 6 years ago

@adam9500370 Of course.

model:
    arch: icnet
data:
    dataset: cityscapes
    train_split: train
    val_split: val
    # icnet should be 32*n+1
    img_rows: 513
    img_cols: 1025
    path: /media/jintain/sg/permanent/datasets/Cityscapes
training:
    train_iters: 300000
    batch_size: 1
    val_interval: 1000
    n_workers: 16
    print_interval: 50
    optimizer:
        name: 'sgd'
        lr: 1.0e-10
        weight_decay: 0.0005
        momentum: 0.99
    loss:
        name: 'multi_scale_cross_entropy'
        size_average: False
    lr_schedule:
#    resume: fcn8s_pascal_best_model.pkl
    resume: runs/icnet_cityscapes_best_model.pkl

nothing else change. Training on cityscapes and using the default cityscapes dataloader

adam9500370 commented 6 years ago

Due to size_average: False for loss calculation, you may get a very large loss value (summation of cross entropy loss for all the pixels of all the images in each batch). I think you may need to set size_average: True to calculate mean of loss value.

In addition, if you train the model from scratch, you may need to try the followings:

Set arch: icnetBN to include BatchNorm (is_batchnorm: True)
Set a larger batch size (e.g., 8) and larger image size (e.g., (1025, 2049))
Set a larger learning rate (e.g., 1.0e-2) and choose a LR scheduler (e.g., poly_lr)

You can also download the converted Caffe pretrained Cityscapes models here, and set img_norm=False and version="pascal" arguments in data_loader (due to data preprocessing of original Caffe implementation).

lucasjinreal commented 6 years ago

@adam9500370 Hi, I take your advise and retrain from scratch, but the mean IOU still not normal. Here is the log:

Iter [2800/300000]  Loss: 1.7671  Time/Image: 0.1351
Iter [2850/300000]  Loss: 1.8565  Time/Image: 0.1378
Iter [2900/300000]  Loss: 1.8952  Time/Image: 0.1374
Iter [2950/300000]  Loss: 1.7559  Time/Image: 0.1380
Iter [3000/300000]  Loss: 1.7315  Time/Image: 0.1363
0it [00:00, ?it/s]WARN: resizing labels yielded fewer classes
63it [00:55,  3.46it/s]
Overall Acc:     0.7806173583871298
Mean Acc :   0.26045823400686646
FreqW Acc :      0.64662924844955
Mean IoU :   0.20318397657362453
Iter [3050/300000]  Loss: 1.6093  Time/Image: 0.1298
Iter [3100/300000]  Loss: 1.7549  Time/Image: 0.1368
Iter [3150/300000]  Loss: 1.6235  Time/Image: 0.1380
Iter [3200/300000]  Loss: 1.3351  Time/Image: 0.1375
Iter [3250/300000]  Loss: 1.4034  Time/Image: 0.1393
Iter [3300/300000]  Loss: 1.7972  Time/Image: 0.1369
WARN: resizing labels yielded fewer classes
Iter [3350/300000]  Loss: 1.6406  Time/Image: 0.1366
Iter [3400/300000]  Loss: 1.7513  Time/Image: 0.1395
WARN: resizing labels yielded fewer classes
Iter [3450/300000]  Loss: 1.6573  Time/Image: 0.1381
Iter [3500/300000]  Loss: 2.1634  Time/Image: 0.1379
Iter [3550/300000]  Loss: 1.4725  Time/Image: 0.1357
Iter [3600/300000]  Loss: 1.5244  Time/Image: 0.1386
Iter [3650/300000]  Loss: 1.4610  Time/Image: 0.1374
Iter [3700/300000]  Loss: 1.6305  Time/Image: 0.1372
Iter [3750/300000]  Loss: 1.5950  Time/Image: 0.1387
Iter [3800/300000]  Loss: 1.8183  Time/Image: 0.1326
Iter [3850/300000]  Loss: 1.9768  Time/Image: 0.1387
Iter [3900/300000]  Loss: 1.4756  Time/Image: 0.1380
WARN: resizing labels yielded fewer classes
Iter [3950/300000]  Loss: 1.3690  Time/Image: 0.1374
Iter [4000/300000]  Loss: 1.4399  Time/Image: 0.1379
0it [00:00, ?it/s]WARN: resizing labels yielded fewer classes
63it [00:55,  3.55it/s]
Overall Acc:     0.7558650777368152
Mean Acc :   0.2424623463158562
FreqW Acc :      0.620776533991615
Mean IoU :   0.18858147214744353

As you can see, after almost 4000 iterations, the mean IOU still 0.18, is that normal? Doesn't see any continue improvement..........

adam9500370 commented 6 years ago

Due to high proportion of pixels for road class in the Cityscapes dataset, you may need to do class balancing to set higher loss weights for the rare classes. (reference: https://github.com/Eromera/erfnet_pytorch/blob/09efaac1dc7829e3719552cbe1e63183368f916d/train/main.py#L88-L131) In addition, due to ~3000 training samples in the Cityscapes dataset, you may need to do some augmentations.

lfdeep commented 6 years ago

Hi, @jinfagang . You can set multi_scale_cross_entropy loss function in config file.
loss:
    name: 'multi_scale_cross_entropy'
And change 'exponent' tensor type to float and set the corresponding device (in ptsemseg/loss/loss.py#L36):
scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if input.is_cuda else 'cpu')

when i run pspnet,and modify the loss to: scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if input.is_cuda else 'cpu')

but error occured: AttributeError: 'tuple' object has no attribute 'is_cuda', i don't know how to solve it？

adam9500370 commented 6 years ago

Replace

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if input.is_cuda else 'cpu')

with

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if target.is_cuda else 'cpu')

to avoid handling different input type in different phase.

lfdeep commented 6 years ago

Replace

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if input.is_cuda else 'cpu')

with

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if target.is_cuda else 'cpu')

to avoid handling different input type in different phase.

Thank you very much! but my result is unusual： Iter [450/300000] Loss: 0.5713 Time/Image: 2.4058 Iter [460/300000] Loss: 2.1904 Time/Image: 2.2330 Iter [470/300000] Loss: 3.7478 Time/Image: 2.2353 Iter [480/300000] Loss: 1.8667 Time/Image: 2.2329 Iter [490/300000] Loss: 2.2474 Time/Image: 2.2363 Iter [500/300000] Loss: 1.5397 Time/Image: 2.2435 725it [16:00, 1.31s/it] Iter 500 Loss on Val: 1.7601 Overall Acc: 0.735417399594 Mean Acc : 0.0471207022447 FreqW Acc : 0.550698099316 Mean IoU : 0.0352395812907 i set batch=2, lr=0.01, size_average: True and i use pascal voc +sbd datasets.

adam9500370 commented 6 years ago

Due to high proportion of pixels for background class in the Pascal VOC dataset, if you train the model from scratch, the model might tend to only learn background class. Therefore, you may need to do class balancing to set higher loss weights for the rare classes, or set ignore_index=0 in F.cross_entropy to ignore background class before the model learned for all the other classes.

You can also download the converted Caffe pretrained weights here, and set img_norm=False and version="pascal" arguments in data_loader (due to data preprocessing of original Caffe implementation). Then use larger batch size and smaller learning rate to fine-tune the model on these datasets.

lfdeep commented 6 years ago

Due to high proportion of pixels for background class in the Pascal VOC dataset, if you train the model from scratch, the model might tend to only learn background class. Therefore, you may need to do class balancing to set higher loss weights for the rare classes, or set ignore_index=0 in F.cross_entropy to ignore background class before the model learned for all the other classes.

You can also download the converted Caffe pretrained weights here, and set img_norm=False and version="pascal" arguments in data_loader (due to data preprocessing of original Caffe implementation). Then use larger batch size and smaller learning rate to fine-tune the model on these datasets.

Thank you very much!

erichhhhho commented 5 years ago

@lfdeep Hi, I met the similar problem. I was wondering how you solved this. Thank you

HareshKarnan commented 5 years ago

My network doesn't seem to learn even after 10000 training iterations. the miou is still at 0.20.

meetps / pytorch-semseg

icnet error (icnet return tuple but not write that logic) #161