open-mmlab / mmengine

OpenMMLab Foundational Library for Training Deep Learning Models
https://mmengine.readthedocs.io/
Apache License 2.0
1.18k stars 355 forks source link

Visualization of validation metric does not plot curve #683

Open ZwwWayne opened 2 years ago

ZwwWayne commented 2 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug In IterBasedTrainLoop, when training models and evaluating the model regularly, we cannot see the curve of the metric value (e.g., mIoU) change with the iteration number. This is because the step number is not correctly saved. We can find the in the saved vis_data/scalars.json that all the step value of the evaluation step is 0, which should change with the training. A wrong example looks like below:

{"lr": 0.02, "data_time": xxx, "loss": 0.01, "step": 50}
{"lr": 0.02, "data_time": xxx, "loss": 0.01, "step": 100}
{"mIoU": xxx, "step": 0}
{"lr": 0.02, "data_time": xxx, "loss": 0.01, "step": 150}
{"lr": 0.02, "data_time": xxx, "loss": 0.01, "step": 200}
{"mIoU": xxx, "step": 0}

Reproduction

  1. What command or script did you run?

Simply train an iter-based model in MMSegmentation should reproduce the error and can find the issue in Tensorboard.

  1. Did you make any modifications on the code or config? Did you understand what you have modified?

A typical config looks like

# training schedule for 20e
max_iters = 116000
train_cfg = dict(
    type='IterBasedTrainLoop', max_iters=max_iters, val_interval=5800)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')

# learning rate
param_scheduler = [
    dict(
        type='LinearLR', start_factor=0.001, by_epoch=False, begin=0,
        end=6000),
    dict(
        type='MultiStepLR',
        begin=0,
        end=max_iters,
        by_epoch=False,
        milestones=[69600, 92800],
        gamma=0.1),
]

default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(
        type='CheckpointHook',
        interval=2900,
        by_epoch=False,
        max_keep_ckpts=4,
        save_best='auto',
        rule='greater'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='mmdet.DetVisualizationHook', score_thr=0.25))

env_cfg = dict(
    cudnn_benchmark=False,
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    dist_cfg=dict(backend='nccl', port=29898),
)
randomness = dict(seed=0)
vis_backends = [
    dict(type='LocalVisBackend'),
    dict(type='TensorboardVisBackend')
]
visualizer = dict(
    type='mmdet.DetLocalVisualizer',
    vis_backends=vis_backends,
    name='visualizer')
log_processor = dict(type='LogProcessor', window_size=50, by_epoch=False)

log_level = 'INFO'
load_from = None
resume = False
  1. What dataset did you use? N/A

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
  2. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

HAOCHENYE commented 2 years ago

It seems this is caused by log_metric_by_epoch is not set to False.

ZwwWayne commented 2 years ago

Do we have any doc to discuss about the switch between epoch/iter based running?

ZwwWayne commented 2 years ago

It seems this is caused by log_metric_by_epoch is not set to False.

Which part of the config should also be modified? default_hooks.logger?

HAOCHENYE commented 2 years ago

It seems this is caused by log_metric_by_epoch is not set to False.

Which part of the config should also be modified? default_hooks.logger?

Yes. Now we do not have a doc to tell how to switch between epoch/iter 😢, it should be added in document refactoring.