loss_rpn_cls=0, loss_bbox=0

kinredon commented 4 years ago

Hi,

I train Faster RCNN model using custom data in COCO style, but it makes loss_rpn_bbox: 0.0000 and loss_bbox: 0.0000, the detailed log info are shown in below:

2020-05-16 14:56:10,068 - mmdet - INFO - Start running, host: root@53acd6a3dfd7, work_dir: /dengjinhong/github/tdd_mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_20200516163338
2020-05-16 14:56:10,068 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2020-05-16 14:56:28,250 - mmdet - INFO - Epoch [1][50/534]      lr: 0.00198, eta: 0:38:23, time: 0.362, data_time: 0.013, memory: 3274, loss_rpn_cls: 0.2651, loss_rpn_bbox: 0.0000, loss_cls: 0.4077, acc: 91.1543, loss_bbox: 0.0000, loss: 0.6728
2020-05-16 14:56:42,271 - mmdet - INFO - Epoch [1][100/534]     lr: 0.00398, eta: 0:33:46, time: 0.280, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.4531, acc: 100.0000, loss_bbox: 0.0000, loss: 0.4531
2020-05-16 14:56:57,230 - mmdet - INFO - Epoch [1][150/534]     lr: 0.00597, eta: 0:32:44, time: 0.299, data_time: 0.007, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.1650, acc: 100.0000, loss_bbox: 0.0000, loss: 0.1650
2020-05-16 14:57:12,471 - mmdet - INFO - Epoch [1][200/534]     lr: 0.00797, eta: 0:32:14, time: 0.305, data_time: 0.007, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0266, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0266
2020-05-16 14:57:27,834 - mmdet - INFO - Epoch [1][250/534]     lr: 0.00997, eta: 0:31:53, time: 0.307, data_time: 0.007, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0078, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0078
2020-05-16 14:57:43,162 - mmdet - INFO - Epoch [1][300/534]     lr: 0.01197, eta: 0:31:33, time: 0.307, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0039, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0039
2020-05-16 14:57:58,621 - mmdet - INFO - Epoch [1][350/534]     lr: 0.01397, eta: 0:31:17, time: 0.309, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0023, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0023
2020-05-16 14:58:14,038 - mmdet - INFO - Epoch [1][400/534]     lr: 0.01596, eta: 0:31:00, time: 0.308, data_time: 0.007, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0016, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0016
2020-05-16 14:58:29,521 - mmdet - INFO - Epoch [1][450/534]     lr: 0.01796, eta: 0:30:45, time: 0.310, data_time: 0.007, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0011, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0011
2020-05-16 14:58:44,967 - mmdet - INFO - Epoch [1][500/534]     lr: 0.01996, eta: 0:30:29, time: 0.309, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0008, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0008
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 527/527, 14.5 task/s, elapsed: 36s, ETA:     0s2020-05-16 14:59:34,089 - mmdet - INFO - Evaluating bbox...
Loading and preparing results...
2020-05-16 14:59:34,090 - mmdet - ERROR - The testing results of the whole dataset is empty.
2020-05-16 14:59:34,091 - mmdet - INFO - Epoch [1][534/534]     lr: 0.02000,
2020-05-16 14:59:49,768 - mmdet - INFO - Epoch [2][50/534]      lr: 0.02000, eta: 0:28:19, time: 0.312, data_time: 0.011, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0006, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0006
2020-05-16 15:00:05,338 - mmdet - INFO - Epoch [2][100/534]     lr: 0.02000, eta: 0:28:13, time: 0.311, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0005, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0005
2020-05-16 15:00:20,848 - mmdet - INFO - Epoch [2][150/534]     lr: 0.02000, eta: 0:28:06, time: 0.310, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0004, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0004
2020-05-16 15:00:36,366 - mmdet - INFO - Epoch [2][200/534]     lr: 0.02000, eta: 0:27:57, time: 0.310, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0003, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0004
2020-05-16 15:00:51,860 - mmdet - INFO - Epoch [2][250/534]     lr: 0.02000, eta: 0:27:47, time: 0.310, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0003, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0003
2020-05-16 15:01:07,319 - mmdet - INFO - Epoch [2][300/534]     lr: 0.02000, eta: 0:27:37, time: 0.309, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0003, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0003
2020-05-16 15:01:22,718 - mmdet - INFO - Epoch [2][350/534]     lr: 0.02000, eta: 0:27:25, time: 0.308, data_time: 0.006, memory: 3274, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0000, loss_cls: 0.0002, acc: 100.0000, loss_bbox: 0.0000, loss: 0.0003

I only make a minor change, i.e. the num_classes=5(I have 5 classes) and dataset information.

The diff between my and official config:

56c56
<             num_classes=80,
---
>             num_classes=5,
125c125
< data_root='data/coco/'
---
> data_root='data/Industrial_quality/'
172,173c172,173
<         ann_file='data/coco/annotations/instances_train2017.json',
<         img_prefix='data/coco/train2017/',
---
>         ann_file='data/Industrial_quality/filtered_target_domain_train.json',
>         img_prefix='data/Industrial_quality/target_domain/all_img/',
194,195c194,195
<         ann_file='data/coco/annotations/instances_val2017.json',
<         img_prefix='data/coco/val2017/',
---
>         ann_file='data/Industrial_quality/filtered_target_domain_test.json',
>         img_prefix='data/Industrial_quality/target_domain/all_img/',
217,218c217,218
<         ann_file='data/coco/annotations/instances_val2017.json',
<         img_prefix='data/coco/val2017/',
---
>         ann_file='data/Industrial_quality/filtered_target_domain_test.json',
>         img_prefix='data/Industrial_quality/target_domain/all_img/',

It is noted that this config could work well in v1.x of mmdetection.

I take a lot of time to fix this issue, but it can not solved by me.

Anyone could help me? Big thanks!

oym050922021 commented 4 years ago

hi: when i train the mode called fcos using the datasets in coco format，i find it always makes loss_bbox=0 and loss_centerness=0. Is there anyone know the reason? thank you very much!

lji72 commented 4 years ago

maybe mm2.0 is unstable, you can try mm1.0 instead.

ZwwWayne commented 4 years ago

Hi @kinredon , If the number of classes is 5, you need also change the key classes in your dataset. MMDetection V2.0 supports to train subset of a dataset by setting their classes.

kinredon commented 4 years ago

Hi @ZwwWayne , Thanks. But sorry to bother you again, what does the key classes in my dataset mean?

ecm200 commented 4 years ago

I believe it is the list of class names, which is specified in the configuration file. I am running a binary classification with a single object class at the moment, and I adapted the default configuration file to reflect my bespoke COCO formatted dataset.

I start by defining a list of class names strings at the top of the configuration file, called classes. For each dataset dictionary (train, test, valid), I specify a key 'classes' and assign it the list of strings specified by the list object classes

_base_ = [
    '../configs/_base_/models/mask_rcnn_r50_fpn.py',
    '../configs/_base_/datasets/coco_instance.py',
    '../configs/_base_/schedules/schedule_1x.py', '../configs/_base_/default_runtime.py'
]
dataset_type = 'CocoDataset'
classes=['particle']
data_root = 'datasets/spherical_test_data_v1_5000_1500/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadMorphologiSynImage'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
    dict(type='Resize', img_scale=(1296, 972), keep_ratio=True),
    dict(type='Normalize', **img_norm_cfg),
    #dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
test_pipeline = [
    dict(type='LoadMorphologiSynImage'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1296, 972),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='Normalize', **img_norm_cfg),
            #dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=5,
    train=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/train_coco.json',
        img_prefix=data_root + 'train/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/valid_coco.json',
        img_prefix=data_root + 'valid/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/valid_coco.json',
        img_prefix=data_root + 'valid/',
        pipeline=test_pipeline))
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
checkpoint_config = dict(interval=1)
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        dict(type='TensorboardLoggerHook')
    ])
total_epochs = 30
gpus = 2

kinredon commented 4 years ago

@ecm200 SSSSuper thanks! This works for me.

How do you know that, is there a document mention it?

ecm200 commented 4 years ago

There's a new document for bespoke training formats. https://github.com/open-mmlab/mmdetection/blob/master/docs/tutorials/new_dataset.md

In there they have an example:

...
# dataset settings
dataset_type = 'CocoDataset'
classes = ('a', 'b', 'c', 'd', 'e')
...
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        classes=classes,
        ann_file='path/to/your/train/data',
        ...),
    val=dict(
        type=dataset_type,
        classes=classes,
        ann_file='path/to/your/val/data',
        ...),
    test=dict(
        type=dataset_type,
        classes=classes,
        ann_file='path/to/your/test/data',
        ...))
...

chencheng1203 commented 4 years ago

@ecm200 SSSSuper thanks! This works for me.

How do you know that, is there a document mention it?

hello, I met the same problem, and add classes in my config file, but it still print loss value like that, have you solved? big thanks

wangsssky commented 4 years ago

@chencheng1203 I met the same problem as well. Any solutions?

ecm200 commented 4 years ago

Hi @chencheng1203.

My advice would be double, triple check your dataset format makes sense. I chose to convert my dataset to COCO format so I could lean on all the tools that have developed to assess the COCO data, including the data loader, and the evaluation / validation of the model.

Most of my problems have stemmed from bugs I didn't catch in my input data generation workflow outside of MMDetection. Specifically, I have been generating synthetic data for the problem I am trying to solve, which includes all ancillary data such as bounding boxes, polygons and segmentation masks. As the synthetic is computationally expensive to produce, I have used augmentation to increase the amount of training data available. This involved rotations of the original image, as the images are rectangular, results in parts of polygons and bounding boxes outside of the image frame.

A majority of my problems stemmed from not having caught exceptions in the training data, to limit the polygon and bounding box extents to within the image frame. In particular, the last one that eventually solved the issue for me, was bounding boxes that had some portion of their area outside the image frame. I had removed those completely rotated out of the image frame, but not those straddling it. I solved the issue by squashing the bounding box extents to the image frame extents, and this seems to have solved the problem.

Other problems related to the training images not being of sufficient enough quality. As the process for modelling synthetic data is probablistic, there were cases where extreme model images were produced, which resulted in almost no objects in the image or a single massive object that obscured 90% of the frame.

So, the main message check, check and check again your input data and make sure your input data is as well conditioned as you can make it, avoiding pitfalls like having bounding boxes or polygons outside the image frame.

ecm200 commented 4 years ago

Oh the other thing with classes is to make sure they are changed in the model configuration of the bbox_head and mask_head of the roi_head.

I have a two class problem at the moment (including the background), and this is my Mask RCNN FPN ResNet101 model config. There are some bespoke additions I have made. These are mainly the image loading pipeline, where I have made a bespoke image loading function, as my images are single channel, and 12-bit. I have also subclassed the COCODataset class to make my own bespoke Dataset class, which was mainly for understanding and modifying the existing COCO evaluation functions for validation at the end of each training epoch.

I have also built a validation checkpointer hook, to save the model only if the desired validation metric has improved from the previous epoch.

This model has currently been running using 5000 image training set and is on the 10th Epoch and converging nicely

_base_ = [
    '../configs/_base_/models/mask_rcnn_r50_fpn.py',
    '../configs/_base_/datasets/coco_instance.py',
    '../configs/_base_/schedules/schedule_1x.py', 
    '../configs/_base_/default_runtime.py'
]
dataset_type = 'MorphologiDataset' # 'CocoDataset'
data_root_dir = '/datadrive/drive0/willow_tree_synthetic/mpdsim/train_test_coco_fmt/'
data_dir = 'spherical_test_data_v1p4_5000_500/' # 'test_dataset_v1p2_100/' #
data_root = data_root_dir+data_dir
classes=['particle']
# Update model due to classes
model = dict(
    pretrained=None, # Don't load the pretrained weights. TODO Find out if this is needed changing number of classes.
    roi_head=dict(
        bbox_head=dict(
            type='Shared2FCBBoxHead',
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=len(classes), # Modified to number of classes in this problem. COCO default is 80.
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0., 0., 0., 0.],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)), # This is L1Loss in standard COCO, specified this way in CityScapes
        mask_head=dict(
            type='FCNMaskHead',
            num_convs=4,
            in_channels=256,
            conv_out_channels=256,
            num_classes=len(classes), # Modified to number of classes in this problem. COCO default is 80.
            loss_mask=dict(
                type='CrossEntropyLoss', use_mask=True, loss_weight=1.0)))
)
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadMorphologiSynImage'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
    dict(type='Resize', img_scale=(1296, 972), keep_ratio=True),
    dict(type='Normalize', **img_norm_cfg),
    #dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
test_pipeline = [
    dict(type='LoadMorphologiSynImage'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1296, 972),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='Normalize', **img_norm_cfg),
            #dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=2, # TODO Usually set to 2, set to 1 for debugging.
    imgs_per_gpu=2, # TODO Usually set to 2, set to 1 for debugging.
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/train_coco.json',
        img_prefix=data_root + 'train/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/valid_coco.json',
        img_prefix=data_root + 'valid/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/valid_coco.json',
        img_prefix=data_root + 'valid/',
        pipeline=test_pipeline))
test_cfg = dict(
    rpn=dict(
        nms_across_levels=False,
        nms_pre=1000,
        nms_post=1000,
        max_num=1000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        score_thr=0.05,
        nms=dict(type='nms', iou_thr=0.5),
        max_per_img=500, # Default is 100
        mask_thr_binary=0.5)) # Default is 0.5
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
checkpoint_config = dict(type='CheckpointHook', interval=1)
# checkpoint_config = dict(type='ValidationCheckpointHook', metric='acc', metric_ops='max', overwrite_checkpoint=True)
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        dict(type='TensorboardLoggerHook')
    ])
total_epochs = 36
lr_config = dict(step=[24, 33]) # Based on 8 and 11 for 12 Epochs.
gpus = 1
evaluation = dict(interval=1, metric='bbox') # TODO Remove, temporarily here so that evaluation is done on BBOX only at the moment as SEGM not working.

chencheng1203 commented 4 years ago

Hi @chencheng1203.

My advice would be double, triple check your dataset format makes sense. I chose to convert my dataset to COCO format so I could lean on all the tools that have developed to assess the COCO data, including the data loader, and the evaluation / validation of the model.

Most of my problems have stemmed from bugs I didn't catch in my input data generation workflow outside of MMDetection. Specifically, I have been generating synthetic data for the problem I am trying to solve, which includes all ancillary data such as bounding boxes, polygons and segmentation masks. As the synthetic is computationally expensive to produce, I have used augmentation to increase the amount of training data available. This involved rotations of the original image, as the images are rectangular, results in parts of polygons and bounding boxes outside of the image frame.

A majority of my problems stemmed from not having caught exceptions in the training data, to limit the polygon and bounding box extents to within the image frame. In particular, the last one that eventually solved the issue for me, was bounding boxes that had some portion of their area outside the image frame. I had removed those completely rotated out of the image frame, but not those straddling it. I solved the issue by squashing the bounding box extents to the image frame extents, and this seems to have solved the problem.

Other problems related to the training images not being of sufficient enough quality. As the process for modelling synthetic data is probablistic, there were cases where extreme model images were produced, which resulted in almost no objects in the image or a single massive object that obscured 90% of the frame.

So, the main message check, check and check again your input data and make sure your input data is as well conditioned as you can make it, avoiding pitfalls like having bounding boxes or polygons outside the image frame.

well, thank you for telling me so much bro, it realy works, I found my issue in my dataset, something wrong with it

wangsssky commented 4 years ago

@ecm200 Thank you for your config example. You are right, this issue is caused by the classes. I didn't notice that I should add this line here for my custom dataset as a newbie to mmdetection.

train=dict(
        type=dataset_type,
       --> classes=classes,
        ann_file='path/to/your/train/data',
        ...),

open-mmlab / mmdetection

loss_rpn_cls=0, loss_bbox=0 #2744