w1oves / Rein

[CVPR 2024] Official implement of <Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation>
https://zxwei.site/rein
GNU General Public License v3.0
215 stars 19 forks source link

About EVA02's config file #54

Closed xiaoxia0722 closed 2 weeks ago

xiaoxia0722 commented 1 month ago

Hello, the results are not ideal when we reproduce EVA02. Can you provide the eva02 configuration file? The configuration file of EVA02 which we used is as follows:

backbone_norm_cfg = dict(eps=1e-06, requires_grad=True, type='LN')
bdd_crop_size = (
    512,
    512,
)
bdd_root = '/workspace/Rein/data/bdd100k/'
bdd_test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(keep_ratio=True, scale=(
        1280,
        720,
    ), type='Resize'),
    dict(type='LoadAnnotations'),
    dict(type='PackSegInputs'),
]
bdd_train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(scale=(
        1280,
        720,
    ), type='Resize'),
    dict(cat_max_ratio=0.75, crop_size=(
        512,
        512,
    ), type='RandomCrop'),
    dict(prob=0.5, type='RandomFlip'),
    dict(type='PhotoMetricDistortion'),
    dict(type='PackSegInputs'),
]
bdd_type = 'CityscapesDataset'
cityscapes_crop_size = (
    512,
    512,
)
cityscapes_root = '/workspace/Rein/data/cityscapes/'
cityscapes_test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(keep_ratio=True, scale=(
        1024,
        512,
    ), type='Resize'),
    dict(type='LoadAnnotations'),
    dict(type='PackSegInputs'),
]
cityscapes_train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(scale=(
        1024,
        512,
    ), type='Resize'),
    dict(cat_max_ratio=0.75, crop_size=(
        512,
        512,
    ), type='RandomCrop'),
    dict(prob=0.5, type='RandomFlip'),
    dict(type='PhotoMetricDistortion'),
    dict(type='PackSegInputs'),
]
cityscapes_type = 'CityscapesDataset'
crop_size = (
    512,
    512,
)
default_hooks = dict(
    checkpoint=dict(
        by_epoch=False, interval=6000, max_keep_ckpts=3,
        type='CheckpointHook'),
    logger=dict(interval=50, log_metric_by_epoch=False, type='LoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(type='SegVisualizationHook'))
default_scope = 'mmseg'
embed_multi = dict(decay_mult=0.0, lr_mult=1.0)
env_cfg = dict(
    cudnn_benchmark=True,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
gta_crop_size = (
    512,
    512,
)
gta_root = '/workspace/Rein/data/gta/'
gta_test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(keep_ratio=True, scale=(
        1280,
        720,
    ), type='Resize'),
    dict(type='LoadAnnotations'),
    dict(type='PackSegInputs'),
]
gta_train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(scale=(
        1280,
        720,
    ), type='Resize'),
    dict(cat_max_ratio=0.75, crop_size=(
        512,
        512,
    ), type='RandomCrop'),
    dict(prob=0.5, type='RandomFlip'),
    dict(type='PhotoMetricDistortion'),
    dict(type='PackSegInputs'),
]
gta_type = 'CityscapesDataset'
launcher = 'none'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False)
mapillary_crop_size = (
    512,
    512,
)
mapillary_root = '/workspace/Rein/data/mapillary/'
mapillary_test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(keep_ratio=True, scale=(
        1024,
        512,
    ), type='Resize'),
    dict(type='LoadAnnotations'),
    dict(type='PackSegInputs'),
]
mapillary_train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(scale=(
        1024,
        512,
    ), type='Resize'),
    dict(cat_max_ratio=0.75, crop_size=(
        512,
        512,
    ), type='RandomCrop'),
    dict(prob=0.5, type='RandomFlip'),
    dict(type='PhotoMetricDistortion'),
    dict(type='PackSegInputs'),
]
mapillary_type = 'CityscapesDataset'
model = dict(
    backbone=dict(
        depth=24,
        drop_path_rate=0.2,
        embed_dim=1024,
        img_size=512,
        in_chans=3,
        init_values=None,
        intp_freq=True,
        mlp_ratio=2.6666666666666665,
        naiveswiglu=True,
        norm_layer=dict(eps=1e-06, requires_grad=True, type='LN'),
        num_heads=16,
        out_indices=[
            7,
            11,
            15,
            23,
        ],
        patch_size=16,
        pretrained='/workspace/Rein/checkpoints/eva02_L_converted.pth',
        pt_hw_seq_len=16,
        qkv_bias=True,
        reins_config=dict(
            embed_dims=1024,
            link_token_to_query=True,
            lora_dim=16,
            num_layers=24,
            patch_size=16,
            token_length=100,
            type='LoRAReins'),
        rope=True,
        subln=True,
        type='ReinsEVA2',
        use_abs_pos_emb=True,
        use_checkpoint=False,
        use_rel_pos_bias=False,
        use_shared_rel_pos_bias=False,
        xattn=True),
    data_preprocessor=dict(
        bgr_to_rgb=True,
        mean=[
            123.675,
            116.28,
            103.53,
        ],
        pad_val=0,
        seg_pad_val=255,
        size=(
            512,
            512,
        ),
        std=[
            58.395,
            57.12,
            57.375,
        ],
        type='SegDataPreProcessor'),
    decode_head=dict(
        align_corners=False,
        enforce_decoder_input_project=False,
        feat_channels=256,
        in_channels=[
            1024,
            1024,
            1024,
            1024,
        ],
        loss_cls=dict(
            class_weight=[
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                0.1,
            ],
            loss_weight=2.0,
            reduction='mean',
            type='mmdet.CrossEntropyLoss',
            use_sigmoid=False),
        loss_dice=dict(
            activate=True,
            eps=1.0,
            loss_weight=5.0,
            naive_dice=True,
            reduction='mean',
            type='mmdet.DiceLoss',
            use_sigmoid=True),
        loss_mask=dict(
            loss_weight=5.0,
            reduction='mean',
            type='mmdet.CrossEntropyLoss',
            use_sigmoid=True),
        num_classes=19,
        num_queries=100,
        num_transformer_feat_level=3,
        out_channels=256,
        pixel_decoder=dict(
            act_cfg=dict(type='ReLU'),
            encoder=dict(
                init_cfg=None,
                layer_cfg=dict(
                    ffn_cfg=dict(
                        act_cfg=dict(inplace=True, type='ReLU'),
                        embed_dims=256,
                        feedforward_channels=1024,
                        ffn_drop=0.0,
                        num_fcs=2),
                    self_attn_cfg=dict(
                        batch_first=True,
                        dropout=0.0,
                        embed_dims=256,
                        im2col_step=64,
                        init_cfg=None,
                        norm_cfg=None,
                        num_heads=8,
                        num_levels=3,
                        num_points=4)),
                num_layers=6),
            init_cfg=None,
            norm_cfg=dict(num_groups=32, type='GN'),
            num_outs=3,
            positional_encoding=dict(normalize=True, num_feats=128),
            type='mmdet.MSDeformAttnPixelDecoder'),
        positional_encoding=dict(normalize=True, num_feats=128),
        replace_query_feat=True,
        strides=[
            4,
            8,
            16,
            32,
        ],
        train_cfg=dict(
            assigner=dict(
                match_costs=[
                    dict(type='mmdet.ClassificationCost', weight=2.0),
                    dict(
                        type='mmdet.CrossEntropyLossCost',
                        use_sigmoid=True,
                        weight=5.0),
                    dict(
                        eps=1.0,
                        pred_act=True,
                        type='mmdet.DiceCost',
                        weight=5.0),
                ],
                type='mmdet.HungarianAssigner'),
            importance_sample_ratio=0.75,
            num_points=12544,
            oversample_ratio=3.0,
            sampler=dict(type='mmdet.MaskPseudoSampler')),
        transformer_decoder=dict(
            init_cfg=None,
            layer_cfg=dict(
                cross_attn_cfg=dict(
                    attn_drop=0.0,
                    batch_first=True,
                    dropout_layer=None,
                    embed_dims=256,
                    num_heads=8,
                    proj_drop=0.0),
                ffn_cfg=dict(
                    act_cfg=dict(inplace=True, type='ReLU'),
                    add_identity=True,
                    dropout_layer=None,
                    embed_dims=256,
                    feedforward_channels=2048,
                    ffn_drop=0.0,
                    num_fcs=2),
                self_attn_cfg=dict(
                    attn_drop=0.0,
                    batch_first=True,
                    dropout_layer=None,
                    embed_dims=256,
                    num_heads=8,
                    proj_drop=0.0)),
            num_layers=9,
            return_intermediate=True),
        type='ReinMask2FormerHead'),
    test_cfg=dict(crop_size=(
        512,
        512,
    ), mode='slide', stride=(
        341,
        341,
    )),
    train_cfg=dict(),
    type='EncoderDecoder')
norm_cfg = dict(requires_grad=True, type='SyncBN')
num_classes = 19
optim_wrapper = dict(
    constructor='PEFTOptimWrapperConstructor',
    optimizer=dict(
        betas=(
            0.9,
            0.999,
        ),
        eps=1e-08,
        lr=0.0001,
        type='AdamW',
        weight_decay=0.05),
    paramwise_cfg=dict(
        custom_keys=dict({
            'learnable_tokens': dict(decay_mult=0.0, lr_mult=1.0),
            'level_embed': dict(decay_mult=0.0, lr_mult=1.0),
            'norm': dict(decay_mult=0.0),
            'query_embed': dict(decay_mult=0.0, lr_mult=1.0),
            'reins.scale': dict(decay_mult=0.0, lr_mult=1.0)
        }),
        norm_decay_mult=0.0))
param_scheduler = [
    dict(
        begin=0,
        by_epoch=False,
        end=60000,
        eta_min=0,
        power=0.9,
        type='PolyLR'),
]
randomness = dict(seed=78)
resume = False
test_cfg = dict(type='TestLoop')
test_dataloader = dict(
    batch_size=1,
    dataset=dict(
        datasets=[
            dict(
                data_prefix=dict(
                    img_path='leftImg8bit/val', seg_map_path='gtFine/val'),
                data_root='/workspace/Rein/data/cityscapes/',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        1024,
                        512,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                type='CityscapesDataset'),
            dict(
                data_prefix=dict(
                    img_path='images/10k/val',
                    seg_map_path='labels/sem_seg/masks/val'),
                data_root='/workspace/Rein/data/bdd100k/',
                img_suffix='.jpg',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        1280,
                        720,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                seg_map_suffix='.png',
                type='CityscapesDataset'),
            dict(
                data_prefix=dict(
                    img_path='half/val_img', seg_map_path='half/val_label'),
                data_root='/workspace/Rein/data/mapillary/',
                img_suffix='.jpg',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        1024,
                        512,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                seg_map_suffix='.png',
                type='CityscapesDataset'),
        ],
        type='ConcatDataset'),
    num_workers=4,
    persistent_workers=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))
test_evaluator = dict(
    dataset_keys=[
        'citys',
        'map',
        'bdd',
    ],
    iou_metrics=[
        'mIoU',
    ],
    type='DGIoUMetric')
train_bdd = dict(
    data_prefix=dict(
        img_path='images/10k/train',
        seg_map_path='labels/sem_seg/masks/train'),
    data_root='/workspace/Rein/data/bdd100k/',
    img_suffix='.jpg',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(type='LoadAnnotations'),
        dict(scale=(
            1280,
            720,
        ), type='Resize'),
        dict(cat_max_ratio=0.75, crop_size=(
            512,
            512,
        ), type='RandomCrop'),
        dict(prob=0.5, type='RandomFlip'),
        dict(type='PhotoMetricDistortion'),
        dict(type='PackSegInputs'),
    ],
    seg_map_suffix='.png',
    type='CityscapesDataset')
train_cfg = dict(
    max_iters=60000, type='IterBasedTrainLoop', val_interval=10000)
train_cityscapes = dict(
    data_prefix=dict(
        img_path='leftImg8bit/train', seg_map_path='gtFine/train'),
    data_root='/workspace/Rein/data/cityscapes/',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(type='LoadAnnotations'),
        dict(scale=(
            1024,
            512,
        ), type='Resize'),
        dict(cat_max_ratio=0.75, crop_size=(
            512,
            512,
        ), type='RandomCrop'),
        dict(prob=0.5, type='RandomFlip'),
        dict(type='PhotoMetricDistortion'),
        dict(type='PackSegInputs'),
    ],
    type='CityscapesDataset')
train_dataloader = dict(
    batch_size=2,
    dataset=dict(
        data_prefix=dict(img_path='images', seg_map_path='labels'),
        data_root='/workspace/Rein/data/gta/',
        img_suffix='.png',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(
                max_size=2048,
                resize_type='ResizeShortestEdge',
                scales=[
                    256,
                    307,
                    358,
                    409,
                    460,
                    512,
                    563,
                    614,
                    665,
                    716,
                    768,
                    819,
                    870,
                    921,
                    972,
                    1024,
                ],
                type='RandomChoiceResize'),
            dict(
                cat_max_ratio=0.75, crop_size=(
                    512,
                    512,
                ), type='RandomCrop'),
            dict(prob=0.5, type='RandomFlip'),
            dict(type='PhotoMetricDistortion'),
            dict(type='PackSegInputs'),
        ],
        seg_map_suffix='_labelTrainIds.png',
        type='CityscapesDataset'),
    num_workers=2,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=True, type='InfiniteSampler'))
train_gta = dict(
    data_prefix=dict(img_path='images', seg_map_path='labels'),
    data_root='/workspace/Rein/data/gta/',
    img_suffix='.png',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(type='LoadAnnotations'),
        dict(scale=(
            1280,
            720,
        ), type='Resize'),
        dict(cat_max_ratio=0.75, crop_size=(
            512,
            512,
        ), type='RandomCrop'),
        dict(prob=0.5, type='RandomFlip'),
        dict(type='PhotoMetricDistortion'),
        dict(type='PackSegInputs'),
    ],
    seg_map_suffix='_labelTrainIds.png',
    type='CityscapesDataset')
train_mapillary = dict(
    data_prefix=dict(
        img_path='training/images',
        seg_map_path='cityscapes_trainIdLabel/train/label'),
    data_root='/workspace/Rein/data/mapillary/',
    img_suffix='.jpg',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(type='LoadAnnotations'),
        dict(scale=(
            1024,
            512,
        ), type='Resize'),
        dict(cat_max_ratio=0.75, crop_size=(
            512,
            512,
        ), type='RandomCrop'),
        dict(prob=0.5, type='RandomFlip'),
        dict(type='PhotoMetricDistortion'),
        dict(type='PackSegInputs'),
    ],
    seg_map_suffix='.png',
    type='CityscapesDataset')
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(
        max_size=2048,
        resize_type='ResizeShortestEdge',
        scales=[
            256,
            307,
            358,
            409,
            460,
            512,
            563,
            614,
            665,
            716,
            768,
            819,
            870,
            921,
            972,
            1024,
        ],
        type='RandomChoiceResize'),
    dict(cat_max_ratio=0.75, crop_size=(
        512,
        512,
    ), type='RandomCrop'),
    dict(prob=0.5, type='RandomFlip'),
    dict(type='PhotoMetricDistortion'),
    dict(type='PackSegInputs'),
]
tta_model = dict(type='SegTTAModel')
val_bdd = dict(
    data_prefix=dict(
        img_path='images/10k/val', seg_map_path='labels/sem_seg/masks/val'),
    data_root='/workspace/Rein/data/bdd100k/',
    img_suffix='.jpg',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(keep_ratio=True, scale=(
            1280,
            720,
        ), type='Resize'),
        dict(type='LoadAnnotations'),
        dict(type='PackSegInputs'),
    ],
    seg_map_suffix='.png',
    type='CityscapesDataset')
val_cfg = dict(type='ValLoop')
val_cityscapes = dict(
    data_prefix=dict(img_path='leftImg8bit/val', seg_map_path='gtFine/val'),
    data_root='/workspace/Rein/data/cityscapes/',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(keep_ratio=True, scale=(
            1024,
            512,
        ), type='Resize'),
        dict(type='LoadAnnotations'),
        dict(type='PackSegInputs'),
    ],
    type='CityscapesDataset')
val_dataloader = dict(
    batch_size=1,
    dataset=dict(
        datasets=[
            dict(
                data_prefix=dict(
                    img_path='leftImg8bit/val', seg_map_path='gtFine/val'),
                data_root='/workspace/Rein/data/cityscapes/',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        1024,
                        512,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                type='CityscapesDataset'),
            dict(
                data_prefix=dict(
                    img_path='images/10k/val',
                    seg_map_path='labels/sem_seg/masks/val'),
                data_root='/workspace/Rein/data/bdd100k/',
                img_suffix='.jpg',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        1280,
                        720,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                seg_map_suffix='.png',
                type='CityscapesDataset'),
            dict(
                data_prefix=dict(
                    img_path='half/val_img', seg_map_path='half/val_label'),
                data_root='/workspace/Rein/data/mapillary/',
                img_suffix='.jpg',
                pipeline=[
                    dict(type='LoadImageFromFile'),
                    dict(keep_ratio=True, scale=(
                        1024,
                        512,
                    ), type='Resize'),
                    dict(type='LoadAnnotations'),
                    dict(type='PackSegInputs'),
                ],
                seg_map_suffix='.png',
                type='CityscapesDataset'),
        ],
        type='ConcatDataset'),
    num_workers=4,
    persistent_workers=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))
val_evaluator = dict(
    dataset_keys=[
        'citys',
        'map',
        'bdd',
    ],
    iou_metrics=[
        'mIoU',
    ],
    type='DGIoUMetric')
val_gta = dict(
    data_prefix=dict(img_path='images', seg_map_path='labels'),
    data_root='/workspace/Rein/data/gta/',
    img_suffix='.png',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(keep_ratio=True, scale=(
            1280,
            720,
        ), type='Resize'),
        dict(type='LoadAnnotations'),
        dict(type='PackSegInputs'),
    ],
    seg_map_suffix='_labelTrainIds.png',
    type='CityscapesDataset')
val_mapillary = dict(
    data_prefix=dict(img_path='half/val_img', seg_map_path='half/val_label'),
    data_root='/workspace/Rein/data/mapillary/',
    img_suffix='.jpg',
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(keep_ratio=True, scale=(
            1024,
            512,
        ), type='Resize'),
        dict(type='LoadAnnotations'),
        dict(type='PackSegInputs'),
    ],
    seg_map_suffix='.png',
    type='CityscapesDataset')
vis_backends = [
    dict(type='LocalVisBackend'),
    dict(type='TensorboardVisBackend'),
]
visualizer = dict(
    name='visualizer',
    type='SegLocalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
        dict(type='TensorboardVisBackend'),
    ])
work_dir = './work_dirs/rein_dinov2_mask2former_512x512_bs1x4'
tpy001 commented 1 month ago

@xiaoxia0722 , I also tried to reproduce the results of rein+eva02 using a configuration file similar to yours, but the result in mIoU is only 44.35, which is quite abnormal. I suspect this is due to an incorrect eva02 checkpoint or issues with the code in rein/models/backbones/eva02.py. @xiaoxia0722 , what results did you get when you reproduced it?

xiaoxia0722 commented 1 month ago

I added Rein to the configuration file of eva02 based on frozen_vfms, and the final result was mean mIoU of 60.6167 @tpy001

w1oves commented 2 weeks ago

I sincerely apologize for the delay in responding to your message due to personal reseasons. I have uploaded the config, log, and checkpoint for Rein+EVA02: https://github.com/w1oves/Rein/releases/tag/GTAV%2BEVA-L. Thank you for your support of Rein!