open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
28.66k stars 9.32k forks source link

How to get validation loss/mAP during VarifocalNet training? #10164

Open FraCamp opened 1 year ago

FraCamp commented 1 year ago

I am training vfnet with a custom dataset. This the config: `# The new config inherits a base config to highlight the necessary modification base = '../configs/vfnet/vfnet_x101-64x4d-mdconv-c3-c5_fpn_ms-2x_coco.py'

We also need to change the num_classes in head to match the dataset's annotation

""" model = dict( roi_head=dict( bbox_head=dict(num_classes=1), mask_head=dict(num_classes=1))) """ model = dict( bbox_head=dict(num_classes=5) )

Modify dataset related settings

data_root = '../data/' metainfo = { 'classes': ('swimmer', 'boat', 'jetski', 'life_saving_appliances', 'buoy') } train_dataloader = dict( batch_size=1, dataset=dict( data_root=data_root, metainfo=metainfo, ann_file='annotations/instances_train.json', data_prefix=dict(img='images/train/'))) val_dataloader = dict( dataset=dict( data_root=data_root, metainfo=metainfo, ann_file='annotations/instances_val.json', data_prefix=dict(img='images/val/'))) test_dataloader = val_dataloader

custom_hooks = [ dict(type='CheckInvalidLossHook', interval=50), ]

visualizer = dict( type='DetLocalVisualizer', vis_backends=[dict(type='TensorboardVisBackend')], name='visualizer')

Modify metric related settings

val_evaluator = [ dict( type='CocoMetric', metric=['bbox', 'segm'], ann_file=data_root + 'annotations/instances_val.json', ),

dict(ann_file=data_root + 'annotations/instances_val.json')

]

test_evaluator = val_evaluator

We can use the pre-trained VarifocalNet model to obtain higher performance

load_from = 'starting_point/vfnet_x101_64x4d_fpn_mdconv_c3-c5_mstrain_2x_coco_20201027pth-b5f6da5e.pth'`

This the error during training: 04/16 17:57:34 - mmengine - INFO - Epoch(train) [1][8600/8930] lr: 1.0000e-02 eta: 2 days, 1:48:17 time: 0.8205 data_time: 0.0114 memory: 5493 loss: 0.3249 loss_cls: 0.1500 loss_bbox: 0.0743 loss_bbox_rf: 0.1006 04/16 17:58:16 - mmengine - INFO - Epoch(train) [1][8650/8930] lr: 1.0000e-02 eta: 2 days, 1:47:04 time: 0.8467 data_time: 0.0130 memory: 5493 loss: 0.4560 loss_cls: 0.2140 loss_bbox: 0.1033 loss_bbox_rf: 0.1387 04/16 17:59:00 - mmengine - INFO - Epoch(train) [1][8700/8930] lr: 1.0000e-02 eta: 2 days, 1:46:18 time: 0.8693 data_time: 0.0136 memory: 4851 loss: 0.2184 loss_cls: 0.0783 loss_bbox: 0.0606 loss_bbox_rf: 0.0795 04/16 17:59:41 - mmengine - INFO - Epoch(train) [1][8750/8930] lr: 1.0000e-02 eta: 2 days, 1:44:50 time: 0.8335 data_time: 0.0133 memory: 4828 loss: 0.1380 loss_cls: 0.0417 loss_bbox: 0.0403 loss_bbox_rf: 0.0560 04/16 18:00:24 - mmengine - INFO - Epoch(train) [1][8800/8930] lr: 1.0000e-02 eta: 2 days, 1:43:36 time: 0.8451 data_time: 0.0128 memory: 5493 loss: 0.1609 loss_cls: 0.0089 loss_bbox: 0.0652 loss_bbox_rf: 0.0868 04/16 18:01:08 - mmengine - INFO - Epoch(train) [1][8850/8930] lr: 1.0000e-02 eta: 2 days, 1:43:12 time: 0.8880 data_time: 0.0145 memory: 5590 loss: 0.3535 loss_cls: 0.1631 loss_bbox: 0.0820 loss_bbox_rf: 0.1084 04/16 18:01:53 - mmengine - INFO - Epoch(train) [1][8900/8930] lr: 1.0000e-02 eta: 2 days, 1:43:04 time: 0.9017 data_time: 0.0142 memory: 4828 loss: 0.0083 loss_cls: 0.0065 loss_bbox: 0.0008 loss_bbox_rf: 0.0010 04/16 18:02:18 - mmengine - INFO - Exp name: conf_20230416_155210 04/16 18:02:18 - mmengine - INFO - Saving checkpoint at 1 epochs 04/16 18:02:40 - mmengine - INFO - Epoch(val) [1][ 50/1547] eta: 0:08:36 time: 0.3448 data_time: 0.0281 memory: 4828 04/16 18:02:56 - mmengine - INFO - Epoch(val) [1][ 100/1547] eta: 0:07:56 time: 0.3132 data_time: 0.0066 memory: 1289 04/16 18:03:11 - mmengine - INFO - Epoch(val) [1][ 150/1547] eta: 0:07:26 time: 0.3002 data_time: 0.0064 memory: 1289 04/16 18:03:25 - mmengine - INFO - Epoch(val) [1][ 200/1547] eta: 0:07:00 time: 0.2894 data_time: 0.0057 memory: 1289 04/16 18:03:40 - mmengine - INFO - Epoch(val) [1][ 250/1547] eta: 0:06:38 time: 0.2901 data_time: 0.0064 memory: 1289 04/16 18:03:54 - mmengine - INFO - Epoch(val) [1][ 300/1547] eta: 0:06:18 time: 0.2840 data_time: 0.0061 memory: 1289 04/16 18:04:07 - mmengine - INFO - Epoch(val) [1][ 350/1547] eta: 0:05:57 time: 0.2681 data_time: 0.0046 memory: 1289 04/16 18:04:23 - mmengine - INFO - Epoch(val) [1][ 400/1547] eta: 0:05:44 time: 0.3127 data_time: 0.0066 memory: 1289 04/16 18:04:39 - mmengine - INFO - Epoch(val) [1][ 450/1547] eta: 0:05:31 time: 0.3208 data_time: 0.0065 memory: 1289 04/16 18:04:54 - mmengine - INFO - Epoch(val) [1][ 500/1547] eta: 0:05:15 time: 0.2940 data_time: 0.0057 memory: 1289 04/16 18:05:09 - mmengine - INFO - Epoch(val) [1][ 550/1547] eta: 0:05:01 time: 0.3047 data_time: 0.0057 memory: 1289 04/16 18:05:24 - mmengine - INFO - Epoch(val) [1][ 600/1547] eta: 0:04:46 time: 0.3105 data_time: 0.0062 memory: 1289 04/16 18:05:40 - mmengine - INFO - Epoch(val) [1][ 650/1547] eta: 0:04:31 time: 0.3043 data_time: 0.0060 memory: 1289 04/16 18:05:55 - mmengine - INFO - Epoch(val) [1][ 700/1547] eta: 0:04:17 time: 0.3138 data_time: 0.0064 memory: 1289 04/16 18:06:09 - mmengine - INFO - Epoch(val) [1][ 750/1547] eta: 0:04:00 time: 0.2792 data_time: 0.0052 memory: 1289 04/16 18:06:25 - mmengine - INFO - Epoch(val) [1][ 800/1547] eta: 0:03:46 time: 0.3146 data_time: 0.0062 memory: 1289 04/16 18:06:41 - mmengine - INFO - Epoch(val) [1][ 850/1547] eta: 0:03:31 time: 0.3220 data_time: 0.0067 memory: 1289 04/16 18:06:57 - mmengine - INFO - Epoch(val) [1][ 900/1547] eta: 0:03:16 time: 0.3139 data_time: 0.0065 memory: 1289 04/16 18:07:12 - mmengine - INFO - Epoch(val) [1][ 950/1547] eta: 0:03:01 time: 0.3022 data_time: 0.0061 memory: 1289 04/16 18:07:25 - mmengine - INFO - Epoch(val) [1][1000/1547] eta: 0:02:45 time: 0.2570 data_time: 0.0048 memory: 1289 04/16 18:07:38 - mmengine - INFO - Epoch(val) [1][1050/1547] eta: 0:02:29 time: 0.2594 data_time: 0.0049 memory: 1289 04/16 18:07:52 - mmengine - INFO - Epoch(val) [1][1100/1547] eta: 0:02:13 time: 0.2838 data_time: 0.0055 memory: 1289 04/16 18:08:05 - mmengine - INFO - Epoch(val) [1][1150/1547] eta: 0:01:58 time: 0.2616 data_time: 0.0047 memory: 1289 04/16 18:08:19 - mmengine - INFO - Epoch(val) [1][1200/1547] eta: 0:01:42 time: 0.2681 data_time: 0.0048 memory: 1289 04/16 18:08:33 - mmengine - INFO - Epoch(val) [1][1250/1547] eta: 0:01:27 time: 0.2857 data_time: 0.0053 memory: 1289 04/16 18:08:47 - mmengine - INFO - Epoch(val) [1][1300/1547] eta: 0:01:12 time: 0.2835 data_time: 0.0053 memory: 1289 04/16 18:09:03 - mmengine - INFO - Epoch(val) [1][1350/1547] eta: 0:00:58 time: 0.3247 data_time: 0.0296 memory: 1260 04/16 18:09:19 - mmengine - INFO - Epoch(val) [1][1400/1547] eta: 0:00:43 time: 0.3115 data_time: 0.0139 memory: 1260 04/16 18:09:34 - mmengine - INFO - Epoch(val) [1][1450/1547] eta: 0:00:28 time: 0.2994 data_time: 0.0057 memory: 1260 04/16 18:09:50 - mmengine - INFO - Epoch(val) [1][1500/1547] eta: 0:00:13 time: 0.3175 data_time: 0.0238 memory: 1260 04/16 18:10:04 - mmengine - INFO - Evaluating bbox... Loading and preparing results... 04/16 18:10:04 - mmengine - ERROR - /workspace/mmdetection/mmdet/evaluation/metrics/coco_metric.py - compute_metrics - 461 - The testing results of the whole dataset is empty. 04/16 18:10:05 - mmengine - INFO - Epoch(val) [1][1547/1547] data_time: 0.0232 time: 0.3143 04/16 18:10:05 - mmengine - WARNING - Sincemetricsis an empty dict, the behavior to save the best checkpoint will be skipped in this evaluation. 04/16 18:10:47 - mmengine - INFO - Epoch(train) [2][ 50/8930] lr: 1.0000e-02 eta: 2 days, 1:40:56 time: 0.8508 data_time: 0.0175 memory: 5493 loss: 0.4520 loss_cls: 0.1341 loss_bbox: 0.1361 loss_bbox_rf: 0.1818 04/16 18:11:05 - mmengine - INFO - Exp name: conf_20230416_155210 04/16 18:11:30 - mmengine - INFO - Epoch(train) [2][ 100/8930] lr: 1.0000e-02 eta: 2 days, 1:40:02 time: 0.8622 data_time: 0.0126 memory: 5493 loss: 0.3708 loss_cls: 0.1107 loss_bbox: 0.1092 loss_bbox_rf: 0.1509 04/16 18:12:15 - mmengine - INFO - Epoch(train) [2][ 150/8930] lr: 1.0000e-02 eta: 2 days, 1:39:55 time: 0.9029 data_time: 0.0142 memory: 5493 loss: 0.2171 loss_cls: 0.0661 loss_bbox: 0.0648 loss_bbox_rf: 0.0862

As you can see the training goes well, but while is trying to evaluate the bbox looks like it has some problems, like it cannot find something This is the directory structure, but it looks fine to me: Screenshot 2023-04-16 alle 18 35 09

FraCamp commented 1 year ago

I understood that into coco_metric.py into the results variable, all the labels have an empty array, can someone tell me the file name who gets the results? I believe is some runner of some kind but it is difficult to understand where to find it... or maybe how the labels should be filled, I believe there should be some predict somewhere but cant find it

XCZhou520 commented 3 months ago

I met the same problem. Have you found how to solve it?

Goddaman commented 2 months ago

I met the same problem. Have you found how to solve it?