open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.21k stars 9.4k forks source link

Second learning rate value #1729

Closed Firyuza closed 4 years ago

Firyuza commented 4 years ago

Hi!

Please, can you explain why do I have the second value for learning rate? First is 'lr' -- that has initial value that was set up in config and changing according to policy defined in config as well. But when I debug code and look into console I see 'learning_rate' variable that has another value. I see in runner.py that line of code: outputs['log_vars']['learning_rate'] = self.current_lr() but it returns 'lr' value.

Thanks!

ZwwWayne commented 4 years ago

What is the value of 'lr' and what you see in the console? Please be more specific or provide the log and tell us what you expected to see but did not see.

Firyuza commented 4 years ago

Hi @ZwwWayne!

In console I see 'lr' value as I set up in config file as optimizer = dict(type='SGD', lr=0.0009, ...) Also I set up policy as 'fixed' and warmup is 'constant'. So in console I see 0.0009 value for 'lr'. But there is another 'learning_rate' in console that equals 0.0450.

Also I use one GPU and batch size is 4, workers_per_gpu equals 2.

Thanks!

Firyuza commented 4 years ago

Hi @ZwwWayne ! Any updates?

Thanks!

ZwwWayne commented 4 years ago

Hi @Firyuza , Can you show what your log looks like? Because from my log I only see the lr as you can see from the following picture. image

If you print it for debug, can you tell me which line of which file you set it?

Firyuza commented 4 years ago

Hi @ZwwWayne , This is from my console (copied as text): Epoch [8][50/47991] lr: 0.00043, eta: 17 days, 13:10:57, time: 3.940, data_time: 0.408, memory: 8919, loss_rpn_cls: 0.0032, loss_rpn_bbox: 0.0103, loss_cls: 0.0993, acc: 96.5088, loss_bbox: 0.0338, loss_mask: 0.1790, metric_learning_loss: 1.2020, loss: 1.5274, learning_rate: 0.0215

Question is why learning_rate: 0.0215 as it is and not equals lr: 0.00043,

In runner.py in train method I set up this one:

if 'log_vars' in outputs:
     outputs['log_vars']['learning_rate'] = self.current_lr()
     self.log_buffer.update(outputs['log_vars'],
      outputs['num_samples'])

And this code I use for logging learning rate into TensorBoard. And this code log correct number, i.e. 0.00043 (the value of 'lr' in console) Just wondered how I get learning_rate: 0.0215 (in console), maybe I missed something?

Thanks!

hellock commented 4 years ago

Hi @Firyuza, could you provide your complete config file? Except for the runner, did you make some other modifications?

Firyuza commented 4 years ago

Hi @hellock ! Yes, I've done other modifications, but these modifications haven't touched the learning rate update logic. Thanks for feedback!

from datetime import datetime

# fp16 settings
fp16 = dict(loss_scale=512.)

# model settings
model = dict(
    type='ClothingRetrieval',
    pretrained=None,
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        style='pytorch'),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_scales=[8],
        anchor_ratios=[0.5, 1.0, 2.0],
        anchor_strides=[4, 8, 16, 32, 64],
        target_means=[.0, .0, .0, .0],
        target_stds=[1.0, 1.0, 1.0, 1.0],
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='SmoothL1Loss', beta=1.0 / 9.0, loss_weight=1.0)),
    bbox_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', out_size=7, sample_num=2),
        out_channels=256,
        featmap_strides=[4, 8, 16, 32]),
    bbox_head=dict(
        type='SharedFCBBoxHead',
        num_fcs=2,
        in_channels=256,
        fc_out_channels=1024,
        roi_feat_size=7,
        num_classes=14,
        target_means=[0., 0., 0., 0.],
        target_stds=[0.1, 0.1, 0.2, 0.2],
        reg_class_agnostic=False,
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
        loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)),
    mask_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', out_size=14, sample_num=2),
        out_channels=256,
        featmap_strides=[4, 8, 16, 32]),
    mask_head=dict(
        type='FCNMaskHead',
        num_convs=4,
        in_channels=256,
        conv_out_channels=256,
        num_classes=14,
        loss_mask=dict(
            type='CrossEntropyLoss', use_mask=True, loss_weight=1.0)))

images_per_gpu = 4
# model training and testing settings
train_cfg = dict(
    rpn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.7,
            neg_iou_thr=0.3,
            min_pos_iou=0.3,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=256,
            pos_fraction=0.5,
            neg_pos_ub=-1,
            add_gt_as_proposals=False),
        allowed_border=0,
        pos_weight=-1,
        debug=False),
    rpn_proposal=dict(
        nms_across_levels=False,
        nms_pre=2000,
        nms_post=2000,
        max_num=2000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.5,
            min_pos_iou=0.5,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=512,
            pos_fraction=0.25,
            neg_pos_ub=-1,
            add_gt_as_proposals=True),
        mask_size=28,
        pos_weight=-1,
        debug=False),
    nrof_products_per_batch=images_per_gpu // 2)
test_cfg = dict(
    rpn=dict(
        nms_across_levels=False,
        nms_pre=1000,
        nms_post=1000,
        max_num=1000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        score_thr=0.05,
        nms=dict(type='nms', iou_thr=0.5),
        max_per_img=100,
        mask_thr_binary=0.5))
# dataset settings
dataset_type = 'DeepFashion2'
data_root = ''
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True, with_pair_id=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks', 'gt_pair_id', 'gt_pair_category']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True, with_pair_id=True),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks', 'gt_pair_id']),
        ])
]

data = dict(
    imgs_per_gpu=images_per_gpu,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        sampler_type='GroupSampler',
        ann_file='',
        img_prefix=data_root,
        pipeline=train_pipeline,
        nrof_products_per_batch=images_per_gpu // 2),
    val=dict(
        type=dataset_type,
        ann_file='',
        img_prefix=data_root,
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file='',
        img_prefix=data_root,
        pipeline=test_pipeline))
# optimizer
optimizer = dict(type='SGD', lr=0.0009, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='exp',
    warmup='constant',
    warmup_iters=500,
    warmup_ratio=1.0,
    gamma=0.9,
    step=[8, 11])
checkpoint_config = dict(interval=1)
# yapf:disable
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
evaluation = dict(interval=1)
# runtime settings
total_epochs = 200
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = 'My_working_dir'
load_from = None
resume_from = None
workflow = [('train', 1)]
hellock commented 4 years ago

The following modification you made to the code is wrong.

if 'log_vars' in outputs:
     outputs['log_vars']['learning_rate'] = self.current_lr()
     self.log_buffer.update(outputs['log_vars'],
      outputs['num_samples'])

log_buffer applies moving average to the values it records.

Firyuza commented 4 years ago

Oh, ok. Thanks for feedback!