使用Epoch训练策略时，一直在第一个Epoch训练

Xie-Muxi-BK commented 1 year ago

配置文件：

_base_ = [
    '../_base_/models/pspnet_r50-d8.py', '../_base_/datasets/potsdam.py',
    '../_base_/default_runtime.py', '../_base_/schedules/schedule_80k.py'
]
crop_size = (512, 512)
data_preprocessor = dict(size=crop_size)
model = dict(
    data_preprocessor=data_preprocessor,
    pretrained='open-mmlab://resnet18_v1c',
    backbone=dict(depth=18),
    decode_head=dict(
                     in_channels=512,
                     channels=128, 
                     num_classes=6),
    auxiliary_head=dict(num_classes=6,in_channels=256, channels=64))

# training schedule for 160k
train_cfg = dict(
    _delete_=True,
    type='EpochBasedTrainLoop', 
    max_epochs=10, 
    val_interval=2)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
default_hooks = dict(
    # timer=dict(type='EpochTimerHook'),
    logger=dict(type='LoggerHook', interval=10, log_metric_by_epoch=True),
    # param_scheduler=dict(type='ParamSchedulerHook',convert_to_iter_based=False),
    checkpoint=dict(type='CheckpointHook', by_epoch=True, interval=1),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='SegVisualizationHook'))
log_processor = dict(by_epoch=True)

是什么原因？

mm-assistant[bot] commented 1 year ago

We recommend using English or English & Chinese for issues so that we could have broader discussion.

ff98li commented 1 year ago

Not 100% sure if it could resolve your issue but it looks kind of similar to the issue I encountered. Based on the base config you used in ../_base_/datasets/potsdam.py: https://github.com/open-mmlab/mmsegmentation/blob/b600f7cb26829afa2c785af41755391626fbb446/configs/_base_/datasets/potsdam.py#L42-L52

Maybe you can try https://github.com/open-mmlab/mmsegmentation/issues/2777#issuecomment-1508144760 to see if there's any luck?

Xie-Muxi-BK commented 1 year ago

Not 100% sure if it could resolve your issue but it looks kind of similar to the issue I encountered. Based on the base config you used in ../_base_/datasets/potsdam.py:

https://github.com/open-mmlab/mmsegmentation/blob/b600f7cb26829afa2c785af41755391626fbb446/configs/_base_/datasets/potsdam.py#L42-L52

Maybe you can try #2777 (comment) to see if there's any luck?

Thank you very much for your advice，It works properly after modification in this config but，when I use other config, It will have another exception ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])

I used the simple PSPNet + Potsdam dataset to facilitate the expression of my problem

I've tried it before raise this issue

    sampler=dict(type='InfiniteSampler', shuffle=True),

to

    sampler=dict(type='DefaultSampler', shuffle=True),

But I didn't change parameters to the sampler it in the given config(PSPNet + Potsdam)

such as

_base_ = [
    '../_base_/models/upernet_convnext.py', '../_base_/datasets/ade20k.py',
    '../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
]
crop_size = (512, 512)
data_preprocessor = dict(size=crop_size)
checkpoint_file = 'https://download.openmmlab.com/mmclassification/v0/convnext/downstream/convnext-tiny_3rdparty_32xb128-noema_in1k_20220301-795e9634.pth'  # noqa
model = dict(
    data_preprocessor=data_preprocessor,
    backbone=dict(
        type='mmcls.ConvNeXt',
        arch='tiny',
        out_indices=[0, 1, 2, 3],
        drop_path_rate=0.4,
        layer_scale_init_value=1.0,
        gap_before_final_norm=False,
        init_cfg=dict(
            type='Pretrained', checkpoint=checkpoint_file,
            prefix='backbone.')),
    decode_head=dict(
        in_channels=[96, 192, 384, 768],
        num_classes=150,
    ),
    auxiliary_head=dict(in_channels=384, num_classes=150),
    test_cfg=dict(mode='slide', crop_size=crop_size, stride=(341, 341)),
)

optim_wrapper = dict(
    _delete_=True,
    type='AmpOptimWrapper',
    optimizer=dict(
        type='AdamW', lr=0.0001, betas=(0.9, 0.999), weight_decay=0.05),
    paramwise_cfg={
        'decay_rate': 0.9,
        'decay_type': 'stage_wise',
        'num_layers': 6
    },
    constructor='LearningRateDecayOptimizerConstructor',
    loss_scale='dynamic')

param_scheduler = [
    dict(
        type='LinearLR', start_factor=1e-6, by_epoch=False, begin=0, end=1500),
    dict(
        type='PolyLR',
        power=1.0,
        begin=1500,
        end=160000,
        eta_min=0.0,
        by_epoch=False,
    )
]

# By default, models are trained on 8 GPUs with 2 images per GPU
train_dataloader = dict(batch_size=2,sampler=None)
val_dataloader = dict(batch_size=1,sampler=None)
test_dataloader = val_dataloader

train_cfg = dict(
    _delete_ = True,
    type='EpochBasedTrainLoop', 
    max_epochs=10, 
    val_interval=2)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')

default_hooks = dict(
    # timer=dict(type='EpochTimerHook'),
    logger=dict(type='LoggerHook', interval=10, log_metric_by_epoch=True),
    # param_scheduler=dict(type='ParamSchedulerHook',convert_to_iter_based=False),
    checkpoint=dict(type='CheckpointHook', by_epoch=True, interval=1),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='SegVisualizationHook'))

log_processor = dict(by_epoch=True)

it will be raise

  File "/root/anaconda3/envs/XMX/lib/python3.9/site-packages/torch/nn/functional.py", line 2416, in _verify_batch_size
    raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])

yanrihong commented 1 year ago

i have the same problem ,have u solved that?

Jin455 commented 1 year ago

Hello, I had the same problem and I solved it by doing the following: Adding train_dataloader = dict(sampler=dict(type='DefaultSampler', shuffle=True), drop_last=True )in the configs But I am still not sure why it makes the problem solved.

open-mmlab / mmsegmentation

使用Epoch训练策略时，一直在第一个Epoch训练 #2912