Closed whut2962575697 closed 3 years ago
Could you please show your training log? Example: https://github.com/shinya7y/UniverseNet/issues/5#issuecomment-674520670
Thank you!
Python: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GPU 0: Tesla V100-PCIE-32GB
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.2
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
- CuDNN 7.6.5
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.7.0
OpenCV: 4.5.1
MMCV: 1.1.5
MMDetection: 2.4.0+unknown
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 10.1
------------------------------------------------------------
2021-01-20 14:21:33,609 - mmdet - INFO - Distributed training: False
2021-01-20 14:21:33,960 - mmdet - INFO - Config:
model = dict(
type='GFL',
pretrained=None,
backbone=dict(
type='Res2Net',
depth=50,
scales=4,
base_width=26,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch',
dcn=dict(type='DCN', deform_groups=1, fallback_on_stride=False),
stage_with_dcn=(False, False, False, True)),
neck=[
dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
start_level=1,
add_extra_convs='on_output',
num_outs=5),
dict(
type='SEPC',
out_channels=256,
stacked_convs=4,
pconv_deform=False,
lcconv_deform=True,
ibn=True,
pnorm_eval=True,
lcnorm_eval=True,
lcconv_padding=1)
],
bbox_head=dict(
type='GFLSEPCHead',
num_classes=6,
in_channels=256,
stacked_convs=0,
feat_channels=256,
anchor_generator=dict(
type='AnchorGenerator',
ratios=[1.0],
octave_base_scale=8,
scales_per_octave=1,
strides=[8, 16, 32, 64, 128]),
loss_cls=dict(
type='QualityFocalLoss',
use_sigmoid=True,
beta=2.0,
loss_weight=1.0),
loss_dfl=dict(type='DistributionFocalLoss', loss_weight=0.25),
reg_max=16,
loss_bbox=dict(type='GIoULoss', loss_weight=2.0),
reg_decoded_bbox=True))
train_cfg = dict(
assigner=dict(type='ATSSAssigner', topk=9),
allowed_border=-1,
pos_weight=-1,
debug=False)
test_cfg = dict(
nms_pre=1000,
min_bbox_size=0,
score_thr=0.05,
nms=dict(type='nms', iou_threshold=0.6),
max_per_img=100)
optimizer = dict(type='SGD', lr=0.000125, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=2000,
warmup_ratio=0.001,
step=[8, 11])
total_epochs = 12
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = '/cache/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth'
resume_from = None
workflow = [('train', 1)]
dataset_type = 'TCDataset'
data_root = '/cache/tc_dataset/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
albu_train_transforms = []
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='Resize',
img_scale=[(6000, 3600), (6000, 4000)],
multiscale_mode='range',
keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=[(6000, 3600), (6000, 3800), (6000, 4000)],
flip=True,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
imgs_per_gpu=1,
workers_per_gpu=1,
train=dict(
type='TCDataset',
ann_file='/cache/tc_dataset/annotations/instances_train2017.json',
img_prefix='/cache/tc_dataset/train2017/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='Resize',
img_scale=[(6000, 3600), (6000, 4000)],
multiscale_mode='range',
keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]),
val=dict(
type='TCDataset',
ann_file='/cache/tc_dataset/annotations/instances_val2017.json',
img_prefix='/cache/tc_dataset/val2017/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=[(6000, 3600), (6000, 3800), (6000, 4000)],
flip=True,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='TCDataset',
ann_file='/cache/testA.json',
img_prefix='/cache/testA_imgs/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=[(6000, 3600), (6000, 3800), (6000, 4000)],
flip=True,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
evaluation = dict(interval=1, metric='bbox')
fp16 = dict(loss_scale=512.0)
work_dir = './work_dirs/universenet50_2008_1x'
gpu_ids = range(0, 1)
loading annotations into memory...
Done (t=0.12s)
creating index...
index created!
2021-01-20 14:21:34,920 - mmdet - WARNING - "imgs_per_gpu" is deprecated in MMDet V2.0. Please use "samples_per_gpu" instead
2021-01-20 14:21:34,921 - mmdet - WARNING - Automatically set "samples_per_gpu"="imgs_per_gpu"=1 in this experiments
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
2021-01-20 14:21:38,261 - mmdet - INFO - load checkpoint from /cache/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth
2021-01-20 14:21:38,388 - mmdet - WARNING - The model and loaded state dict do not match exactly
size mismatch for bbox_head.gfl_cls.weight: copying a param with shape torch.Size([80, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([6, 256, 3, 3]).
size mismatch for bbox_head.gfl_cls.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([6]).
2021-01-20 14:21:38,390 - mmdet - INFO - Start running, host: work@job9391f5af-job-universenet2021-5303-0, work_dir: /cache/user-job-dir/codes/UniverseNet/work_dirs/universenet50_2008_1x
2021-01-20 14:21:38,390 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
[W TensorIterator.cpp:924] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the inputs are not in channels_last format. (function operator())
2021-01-20 14:22:54,996 - mmdet - INFO - Epoch [1][50/4310] lr: 3.184e-06, eta: 21:58:05, time: 1.531, data_time: 0.207, memory: 19023, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:24:11,893 - mmdet - INFO - Epoch [1][100/4310] lr: 6.306e-06, eta: 21:59:58, time: 1.538, data_time: 0.224, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:25:25,627 - mmdet - INFO - Epoch [1][150/4310] lr: 9.428e-06, eta: 21:41:37, time: 1.475, data_time: 0.184, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:26:39,751 - mmdet - INFO - Epoch [1][200/4310] lr: 1.255e-05, eta: 21:33:30, time: 1.482, data_time: 0.226, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:27:54,799 - mmdet - INFO - Epoch [1][250/4310] lr: 1.567e-05, eta: 21:31:18, time: 1.501, data_time: 0.213, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:29:06,125 - mmdet - INFO - Epoch [1][300/4310] lr: 1.879e-05, eta: 21:18:48, time: 1.427, data_time: 0.182, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
PyTorch compiling details: PyTorch built with:
- CUDA Runtime 10.2
MMDetection CUDA Compiler: 10.1
Please use the same CUDA version, though it may be irrelevant.
Do simpler networks (e.g., RetinaNet, ATSS, GFL) work? Do popular datasets (e.g., COCO) work?
I train the dataset with Cascade R-CNN, and can get a goodresult.
Average Precision (AP) @[ IoU=0.10:0.50 | area= all | maxDets=100 ] = 0.675
Average Precision (AP) @[ IoU=0.10 | area= all | maxDets=100 ] = 0.708
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.617
Average Precision (AP) @[ IoU=0.10:0.50 | area= small | maxDets=100 ] = 0.603
Average Precision (AP) @[ IoU=0.10:0.50 | area=medium | maxDets=100 ] = 0.793
Average Precision (AP) @[ IoU=0.10:0.50 | area= large | maxDets=100 ] = 0.959
Average Recall (AR) @[ IoU=0.10:0.50 | area= all | maxDets= 1 ] = 0.604
Average Recall (AR) @[ IoU=0.10:0.50 | area= all | maxDets= 10 ] = 0.898
Average Recall (AR) @[ IoU=0.10:0.50 | area= all | maxDets=100 ] = 0.947
Average Recall (AR) @[ IoU=0.10:0.50 | area= small | maxDets=100 ] = 0.925
Average Recall (AR) @[ IoU=0.10:0.50 | area=medium | maxDets=100 ] = 0.965
Average Recall (AR) @[ IoU=0.10:0.50 | area= large | maxDets=100 ] = 0.998
Does training on COCO with the original finetuning_example.py work? In the case of this issue, 500 iterations will be enough to check nan.
Thank you for your great work! But the loss is always nan when I train my own dataset.Can you help me?
This is my config:
# This config shows an example for small-batch fine-tuning from a COCO model. # Please see also the MMDetection tutorial below. # https://github.com/shinya7y/UniverseNet/blob/master/docs/tutorials/finetune.md _base_ = [ '../_base_/models/universenet50_2008.py', # Please change to your dataset config. # '../_base_/datasets/coco_detection_mstrain_480_960.py', '../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py' ] model = dict( pretrained=None, # SyncBN is used in universenet50_2008.py # If total batch size < 16, please change BN settings of backbone. backbone=dict( norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True), # iBN of SEPC is used in universenet50_2008.py # If samples_per_gpu < 4, please change BN settings of SEPC. neck=[ dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, start_level=1, add_extra_convs='on_output', # add_extra_convs=True, # extra_convs_on_inputs=False, num_outs=5), dict( type='SEPC', out_channels=256, stacked_convs=4, pconv_deform=False, lcconv_deform=True, ibn=True, pnorm_eval=True, # please set True if samples_per_gpu < 4 lcnorm_eval=True, # please set True if samples_per_gpu < 4 lcconv_padding=1) ], bbox_head=dict(num_classes=6)) # please change for your dataset dataset_type = 'MyDataset' data_root = '/cache/my_dataset/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', img_scale=[(4000, 2000), (4000, 2400)], multiscale_mode='range', keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict(type='Normalize', **img_norm_cfg), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']), ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', # img_scale=[(4000, 2000), (4000, 2200), (4000, 2400)], flip=True, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict(type='Normalize', **img_norm_cfg), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']), ]) ] data = dict( imgs_per_gpu=1, workers_per_gpu=1, train=dict( type=dataset_type, ann_file=data_root + 'annotations/instances_train2017.json', img_prefix=data_root + 'train2017/', pipeline=train_pipeline), val=dict( type=dataset_type, ann_file=data_root + 'annotations/instances_val2017.json', img_prefix=data_root + 'val2017/', pipeline=test_pipeline), test=dict( type=dataset_type, ann_file=data_root + 'annotations/instances_val2017.json', img_prefix=data_root + 'val2017/', pipeline=test_pipeline)) evaluation = dict(interval=1, metric='bbox') # Optimal total batch size depends on dataset size and learning rate. # If image sizes are not so large and you have enough GPU memory, # larger samples_per_gpu will be preferable. # data = dict(samples_per_gpu=2) # This config assumes that total batch size is 8 (4 GPUs * 2 samples_per_gpu). # Since the batch size is half of other configs, # the learning rate is also halved according to the Linear Scaling Rule. # Tuning learning rate around it will be important on other datasets. # For example, you can try 0.005 first, then 0.002, 0.01, 0.001, and 0.02. optimizer = dict(type='SGD', lr=1.25e-3, momentum=0.9, weight_decay=0.0001) # optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2)) # If fine-tuning from COCO, gradients should not be so large. # It is natural to train models without gradient clipping. optimizer_config = dict(_delete_=True, grad_clip=None) # If fine-tuning from COCO, a warmup_iters of 500 or less may be enough. # This setting is not so important unless losses are unstable during warmup. lr_config = dict(warmup_iters=500) fp16 = dict(loss_scale=512.) # Please set `load_from` to use a COCO pre-trained model. load_from = '/cache/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth' # noqa
I have the same issue, did your fix that?
Sorry, it's not be fixed yet
I close this inactive issue, which lacks enough information for reproducing nan. If it is caused by empty gt, please use the latest code. I have fixed ATSSHead and GFLHead in this repository and mmdet repository in the same way.
Thank you for your great work! But the loss is always nan when I train my own dataset.Can you help me?
This is my config: