Closed WingRS closed 2 years ago
Hi. I guess, this is about the face detection, but if not, anyway, the back-end part of all the object detection models is the same. In the face detection documentation we have:
If you would like to start training from pre-trained weights use --load-weights pararmeter instead of --resume-from. Also you can use parameters such as --epochs, --batch-size, --gpu-num, --base-learning-rate, otherwise default values will be loaded from ${MODEL_TEMPLATE}.
The difference between --load-weights
and --resume-from
is that in first case only weights of the model are loaded and in the second model and optimizer conditions are loaded. So, probably, this is what you are looking for.
Speaking about this:
The main question is that when I start training from the best-saved model the training kinda goes randomly (just oscilates) and then goes down.
— I guess, you load the weights, but the optimizer is initialized from scratch, so it takes some time to collect statistics like running means.
How do you run finetuning? With --load-weights
or with --resume-from
?
I've run it with resume-from
and it seems to be better, but doesn't achieve the same mAP score.
Also I have found out that my server config was wrong, the OBJ_DET_DIR
was pointing to wrong directory. But it still was working, can this influence the results?
Could you please attach an updated screenshot of the training log? About
the OBJ_DET_DIR was pointing to wrong directory. But it still was working, can this influence the results?
If you are training and finetuning the model on the same dataset, this should work ok. If you train on one dataset and finetune on the other, well, in general, it is not guaranteed, the mAP would be higher.
If I understand correctly, you train the model on one specific dataset, the training goes good enough. Then you use the same dataset and finetune the same model, but the mAP goes down, and, moreover, after the finetuning the loss value initially is about 2, whereas in the end of the first training it was about 1 (from the screenshots)?
Yeah, you understood it correctly, I've changed the obj det dir to correct one, run it again and it went even worse. The training log:
2021-09-02 10:45:11,103 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0]
CUDA available: True
GPU 0: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.4.r11.4/compiler.30033411_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.8.1+cu111
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.0.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.9.1+cu111
OpenCV: 4.5.3
MMCV: 1.3.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.4
MMDetection: 2.9.0+
MMDetection Compiler: GCC 9.3
MMDetection CUDA Compiler: 11.4
NNCF: 1.7.0
ONNX: 1.10.1
ONNXRuntime: None
OpenVINO MO: None
OpenVINO IE: None
------------------------------------------------------------
2021-09-02 10:45:11,301 - mmdet - INFO - Distributed training: True
2021-09-02 10:45:11,485 - mmdet - INFO - Config:
input_size = 384
image_width = 384
image_height = 384
width_mult = 1.0
model = dict(
type='SingleStageDetector',
backbone=dict(
type='mobilenetv2_w1',
out_indices=(4, 5),
frozen_stages=-1,
norm_eval=False,
pretrained=True),
neck=None,
bbox_head=dict(
type='SSDHead',
num_classes=9,
in_channels=(96, 320),
anchor_generator=dict(
type='SSDAnchorGeneratorClustered',
strides=(16, 32),
widths=[[
17.665686318905415, 40.73450634200414, 117.61498788545624,
64.34307112517054
],
[
94.72263671830719, 173.1972153968913,
320.2371854303864, 207.336830535971, 352.20547313334856
]],
heights=[[
22.150579702734092, 68.24921767068975, 68.97260088862039,
148.00114686179387
],
[
265.86875666086945, 166.20475919582213,
143.7800147372461, 310.2971364875696,
330.45387885029794
]]),
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=(0.0, 0.0, 0.0, 0.0),
target_stds=(0.1, 0.1, 0.2, 0.2)),
depthwise_heads=True,
depthwise_heads_activations='relu',
loss_balancing=True),
train_cfg=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.4,
neg_iou_thr=0.4,
min_pos_iou=0.0,
ignore_iof_thr=-1,
gt_max_assign_all=False),
smoothl1_beta=1.0,
use_giou=False,
use_focal=False,
allowed_border=-1,
pos_weight=-1,
neg_pos_ratio=3,
debug=False),
test_cfg=dict(
nms=dict(type='nms', iou_threshold=0.45),
min_bbox_size=0,
score_thr=0.02,
max_per_img=200,
nms_pre_classwise=200))
cudnn_benchmark = True
dataset_type = 'CocoDataset'
img_norm_cfg = dict(mean=[0, 0, 0], std=[255, 255, 255], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile', to_float32=True),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='PhotoMetricDistortion',
brightness_delta=32,
contrast_range=(0.5, 1.5),
saturation_range=(0.5, 1.5),
hue_delta=18),
dict(
type='MinIoURandomCrop',
min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
min_crop_size=0.1),
dict(type='Resize', img_scale=(384, 384), keep_ratio=False),
dict(type='Normalize', mean=[0, 0, 0], std=[255, 255, 255], to_rgb=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(384, 384),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=False),
dict(
type='Normalize',
mean=[0, 0, 0],
std=[255, 255, 255],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=64,
workers_per_gpu=4,
train=dict(
type='RepeatDataset',
times=5,
dataset=dict(
type='CocoDataset',
ann_file='/home/mrs/utils/labels_coco/train.json',
img_prefix='/home/mrs/Documents/system_dataset',
pipeline=[
dict(type='LoadImageFromFile', to_float32=True),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='PhotoMetricDistortion',
brightness_delta=32,
contrast_range=(0.5, 1.5),
saturation_range=(0.5, 1.5),
hue_delta=18),
dict(
type='MinIoURandomCrop',
min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
min_crop_size=0.1),
dict(type='Resize', img_scale=(384, 384), keep_ratio=False),
dict(
type='Normalize',
mean=[0, 0, 0],
std=[255, 255, 255],
to_rgb=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
],
classes=[
'rope', 'helmet', 'cellphone', 'backpack', 'survivor', 'cube',
'vent', 'drill', 'fire_extinguisher'
])),
val=dict(
type='CocoDataset',
ann_file='/home/mrs/utils/labels_coco/val.json',
img_prefix='/home/mrs/Documents/system_dataset',
test_mode=True,
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(384, 384),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=False),
dict(
type='Normalize',
mean=[0, 0, 0],
std=[255, 255, 255],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
classes=[
'rope', 'helmet', 'cellphone', 'backpack', 'survivor', 'cube',
'vent', 'drill', 'fire_extinguisher'
]),
test=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/val2017',
test_mode=True,
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(384, 384),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=False),
dict(
type='Normalize',
mean=[0, 0, 0],
std=[255, 255, 255],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
classes=[
'rope', 'helmet', 'cellphone', 'backpack', 'survivor', 'cube',
'vent', 'drill', 'fire_extinguisher'
]))
optimizer = dict(type='SGD', lr=0.05, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(
policy='CosineAnnealing',
min_lr=1e-05,
warmup='linear',
warmup_iters=100,
warmup_ratio=0.1)
checkpoint_config = dict(interval=1)
log_config = dict(
interval=10,
hooks=[dict(type='TextLoggerHook'),
dict(type='TensorboardLoggerHook')])
total_epochs = 1000
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = '/home/mrs/utils/mobilenet_v2_384/resume_90_correct'
load_from = ''
resume_from = '/home/mrs/utils/models_384/big/epoch_90.pth'
workflow = [('train', 1)]
gpu_ids = range(0, 1)
2021-09-02 10:45:13,594 - mmdet - INFO - load checkpoint from /home/mrs/utils/models_384/big/epoch_90.pth
2021-09-02 10:45:13,594 - mmdet - INFO - Use load_from_local loader
2021-09-02 10:45:13,641 - mmdet - INFO - resumed epoch 90, iter 133110
2021-09-02 10:45:13,641 - mmdet - INFO - Start running, host: mrs@gigabedna-focal, work_dir: /home/mrs/utils/mobilenet_v2_384/resume_90_correct
2021-09-02 10:45:13,641 - mmdet - INFO - workflow: [('train', 1)], max: 1000 epochs
2021-09-02 10:45:27,182 - mmdet - INFO - Epoch [91][1/1479] lr: 4.901e-02, eta: 2105 days, 22:56:47, time: 13.519, data_time: 11.897, memory: 10999, loss_cls: 1.1630, loss_bbox: 0.6857, loss: 1.8487
run it again and it went even worse
What I see after the first iteration, the loss is at least the same, as it was in the end of the main tuning. How do you make a conclusion that it goes worse? Could you, please, attach the log of the loss and validation mAP after several epochs or tensorboard screenshot? Also, usually, when the model is finetuned, it is tuned with less LR. If you use the same learning rate for finetuning, it is not guaranteed the model's loss will be lower.
I'd assume that the mAP on the start would be somewhat the same as for the last epochs in previous model, am I mistaken?
I'd assume that the mAP on the start would be somewhat the same as for the last epochs in previous model, am I mistaken?
Yes, it should be. As I see from the first screenshot, in the end of the tuning the mAP was about 0.45. What is the mAP in the beginning of the finetuning (with the fixed directory)? I can not see any mentions of the mAP in the log you provided.
Also, I see from the log that you use cuda 11.1, in general, it is not compatible with our mmdetection, we use 10.2. I saw one of your previous issues which looks similar to this one. Could you, please do the step from the last issue:
run python external/mmdetection/mmdet/utils/collect_env.py and share the output?
Do you use our version of the mmdetection or mmcv one?
Output of the command using the detection env
Traceback (most recent call last):
File "external/mmdetection/mmdet/utils/collect_env.py", line 6, in <module>
import mmdet
ModuleNotFoundError: No module named 'mmdet'
The installation seems to be successful, and the first training was fine, so I assumed it would work with my cuda. The output of the tensorboard with fixed directory
Also output of the command you provided with the venv in source directory:
Traceback (most recent call last):
File "external/mmdetection/mmdet/utils/collect_env.py", line 3, in <module>
from mmcv.utils import collect_env as collect_base_env
ImportError: cannot import name 'collect_env' from 'mmcv.utils' (/home/mrs/utils/training_extensions/venv/lib/python3.8/site-packages/mmcv/utils/__init__.py)
I've just cloned this repository, created a new environment and installed all the packages from scratch using this file. collect_env.py
works good for me, but note, I have cuda 10.2 installed. You can try to install mmcv-full==1.3.3
instead of mmcv
or downgrade the cuda to 10.2 and reinstall the environment (though, I'm not sure your RTX 3090 supports 10.2).
Also output of the command you provided with the venv in source directory:
Did not get it, you have two environments, or the first time you run collect_env.py
without installing the packages from init_venv.sh
?
I do have mmcv, 1.3.0, but not mmdet, that's the issue from what I see. Yes, so the first venv is from the README in the root directory of training_extensions, and the second one in the models/object_detection
mmdet
should be installed while running init_venv.sh
from the models/object_detection
.
The script in the models/object_detection
is all-sufficient.
Could you, please, share the output of the pip list
?
Here is the pip list
Package Version Location
------------------------------- ----------- ---------------------------------------
absl-py 0.13.0
actionlib 1.13.2
addict 2.4.0
angles 1.9.13
astroid 2.5.8
attrs 21.2.0
base_local_planner 1.17.1
bondpy 1.8.6
cachetools 4.2.2
camera-calibration 1.15.3
camera-calibration-parsers 1.12.0
catkin 0.8.9
certifi 2021.5.30
charset-normalizer 2.0.4
controller-manager 0.19.4
controller-manager-msgs 0.19.4
cv-bridge 1.15.0
cycler 0.10.0
Cython 0.29.24
defusedxml 0.7.1
diagnostic-analysis 1.10.3
diagnostic-common-diagnostics 1.10.3
diagnostic-updater 1.10.3
dynamic-reconfigure 1.7.1
editdistance 0.5.3
gazebo_plugins 2.9.1
gazebo_ros 2.9.1
gencpp 0.6.5
geneus 3.0.0
genlisp 0.4.18
genmsg 0.5.16
gennodejs 2.0.2
genpy 0.6.14
google-auth 1.34.0
google-auth-oauthlib 0.4.5
graphviz 0.17
grpcio 1.39.0
idna 3.2
image-geometry 1.15.0
iniconfig 1.1.1
interactive-markers 1.12.0
isort 5.9.3
joblib 1.0.1
joint_state_publisher 1.15.0
joint_state_publisher_gui 1.15.0
jsonschema 3.2.0
jstyleson 0.0.2
kiwisolver 1.3.1
laser_geometry 1.6.7
lazy-object-proxy 1.6.0
Markdown 3.3.4
matplotlib 3.4.3
mccabe 0.6.1
message-filters 1.15.9
mmcv-full 1.3.0
mmpycocotools 12.0.3
natsort 7.1.1
networkx 2.6.2
ninja 1.10.2
nncf 1.7.0
numpy 1.19.5
oauthlib 3.1.1
onnx 1.10.1
opencv-python 4.5.3.56
ote 0.2 /home/mrs/utils/training_extensions/ote
packaging 21.0
pandas 1.3.1
Pillow 8.3.1
pip 21.2.4
pkg_resources 0.0.0
pluggy 0.13.1
Polygon3 3.0.8
protobuf 3.17.3
py 1.10.0
py-trees 0.7.6
pyasn1 0.4.8
pyasn1-modules 0.2.8
pydot 1.4.2
pylint 2.7.2
pyparsing 2.4.7
pyrsistent 0.18.0
pytest 6.2.4
python-dateutil 2.8.2
python-qt-binding 0.4.3
pytorchcv 0.0.66
pytz 2021.1
PyYAML 5.4.1
qt-dotgraph 0.4.2
qt-gui 0.4.2
qt-gui-cpp 0.4.2
qt-gui-py-common 0.4.2
requests 2.26.0
requests-oauthlib 1.3.0
resource_retriever 1.12.6
ros_numpy 0.0.4
rosapi 0.11.13
rosbag 1.15.9
rosboost-cfg 1.15.7
rosbridge_library 0.11.13
rosbridge_server 0.11.13
rosclean 1.15.7
roscreate 1.15.7
rosdoc_lite 0.2.10
rosgraph 1.15.9
roslaunch 1.15.9
roslib 1.15.7
roslint 0.12.0
roslz4 1.15.9
rosmake 1.15.7
rosmaster 1.15.9
rosmsg 1.15.9
rosnode 1.15.9
rosparam 1.15.9
rospy 1.15.9
rosservice 1.15.9
rostest 1.15.9
rostopic 1.15.9
rosunit 1.15.7
roswtf 1.15.9
rqt_action 0.4.9
rqt_bag 0.5.1
rqt_bag_plugins 0.5.1
rqt_console 0.4.11
rqt-controller-manager 0.19.4
rqt_dep 0.4.10
rqt-ez-publisher 0.6.1
rqt_graph 0.4.14
rqt_gui 0.5.2
rqt_gui_py 0.5.2
rqt_image_view 0.4.16
rqt_joint_trajectory_controller 0.18.1
rqt_launch 0.4.9
rqt_logger_level 0.4.11
rqt-moveit 0.5.9
rqt_msg 0.4.9
rqt-multiplot 0.0.12
rqt_nav_view 0.5.7
rqt_plot 0.4.13
rqt_pose_view 0.5.10
rqt_publisher 0.4.9
rqt_py_common 0.5.2
rqt_py_console 0.4.9
rqt_py_trees 0.4.0
rqt-reconfigure 0.5.3
rqt-robot-dashboard 0.5.8
rqt-robot-monitor 0.5.13
rqt_robot_steering 0.5.12
rqt_runtime_monitor 0.5.8
rqt-rviz 0.6.1
rqt_service_caller 0.4.9
rqt_shell 0.4.10
rqt_srv 0.4.8
rqt_tf_tree 0.6.2
rqt_top 0.4.9
rqt_topic 0.4.12
rqt_web 0.4.9
rsa 4.7.2
rviz 1.14.5
scikit-learn 0.24.2
scipy 1.7.1
sensor-msgs 1.13.1
setuptools 44.0.0
six 1.16.0
smach 2.5.0
smach-ros 2.5.0
smach_viewer 3.0.1
smclib 1.8.6
subt_comms_test 0.1.0
tensorboard 2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
terminaltables 3.1.0
test-generator 0.1.1
texttable 1.6.4
tf 1.13.2
tf-conversions 1.13.2
tf2-geometry-msgs 0.7.5
tf2-kdl 0.7.5
tf2-py 0.7.5
tf2-ros 0.7.5
tf2-sensor-msgs 0.7.5
threadpoolctl 2.2.0
toml 0.10.2
topic-tools 1.15.9
torch 1.8.1+cu111
torchvision 0.9.1+cu111
tqdm 4.62.0
typing-extensions 3.10.0.0
unique_id 1.0.6
urllib3 1.26.6
Werkzeug 2.0.1
wheel 0.37.0
wrapt 1.12.1
xacro 1.14.6
yapf 0.31.0
Ok, you can try to install mmdetection manually. In the models/object_detection
run: pip install -e ../../external/mmdetection/ -c constraints.txt
. Or, you can reinitialize another environment using init_venv.sh
script, and I insist on this option. This will guarantee that clean environment was created solely for the object detection and there are no any packages conflict (explicit or implicit).
But can this really make the evaluation lower, since it's working and giving some results.
Between the first training (where it achieved around 40% mAP) and the last one the only that thing changed is - moving the model from /tmp/my_model
to local directory in the /home/
In the ideal situation everything should work as expected, but here are stranger things: mmdetection is not installed (as we see from the pip list and collect_env.py) as the package, but the training works. In this case I'm just not sure how the training (as, consequently, finetuning) works.
Ok, could you try to run the evaluation of your trained model, using separate script? Does it reproduce the same mAP values on the same val set?
I'll do it on friday, and in the meantime I am uploading data to cluster which has cuda 10.2, let's see if the problem will be the same there. Thx for the help, I'll write about it tomorrow.
Hi, so I've found out another issue. When installing on new machine the training_extensions
the part of the init_venv.sh
is failing. In particular the git submodule init ../../external
, it returned with error in my case, and I needed to call it myself from the root
OK, watching for the further results!
Also, I inited the training with lower LR (took a look at the last epoch in best training) with 0.002, but the tensorboard shows that the training begun with higher LR. The log
optimizer = dict(type='SGD', lr=0.0002, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(
policy='CosineAnnealing',
min_lr=1e-05,
warmup='linear',
warmup_iters=100,
warmup_ratio=0.1)
checkpoint_config = dict(interval=1)
log_config = dict(
interval=10,
hooks=[dict(type='TextLoggerHook'),
dict(type='TensorboardLoggerHook')])
total_epochs = 1000
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = '/home/mrs/utils/mobilenet_v2_384/resume_90_correct_lower_lr'
load_from = ''
resume_from = '/home/mrs/utils/epoch_90.pth'
workflow = [('train', 1)]
gpu_ids = range(0, 1)
2021-09-03 07:18:31,018 - mmdet - INFO - load checkpoint from /home/mrs/utils/epoch_90.pth
2021-09-03 07:18:31,018 - mmdet - INFO - Use load_from_local loader
2021-09-03 07:18:31,063 - mmdet - INFO - resumed epoch 90, iter 133110
2021-09-03 07:18:31,063 - mmdet - INFO - Start running, host: mrs@gigabedna-focal, work_dir: /home/mrs/utils/mobilenet_v2_384/resume_90_correct_lower_lr
2021-09-03 07:18:31,063 - mmdet - INFO - workflow: [('train', 1)], max: 1000 epochs
2021-09-03 07:18:44,815 - mmdet - INFO - Epoch [91][1/1479] lr: 4.901e-02, eta: 2138 days, 12:17:58, time: 13.728, data_time: 12.052, memory: 10999, loss_cls: 0.8792, loss_bbox: 0.5224, loss: 1.4016
2021-09-03 07:18:49,317 - mmdet - INFO - Epoch [91][10/1479] lr: 4.901e-02, eta: 242 days, 3:20:20, time: 1.816, data_time: 1.240, memory: 10999, loss_cls: 1.3665, loss_bbox: 0.8127, loss: 2.1792
2021-09-03 07:18:55,506 - mmdet - INFO - Epoch [91][20/1479] lr: 4.901e-02, eta: 125 days, 21:20:29, time: 0.619, data_time: 0.161, memory: 10999, loss_cls: 1.6301, loss_bbox: 2.1397, loss: 3.7698
And the tensorboard screenshot:
The eval.py also ends with an error, but it proves the evaluation of the model that was trained before has mAP as on the screenshot of tensorboard shows. The log:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.430
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.748
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.439
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.471
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.528
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.528
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.528
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.528
OrderedDict([('bbox_mAP', 0.43), ('bbox_mAP_50', 0.748), ('bbox_mAP_75', 0.439), ('bbox_mAP_s', -1.0), ('bbox_mAP_m', -1.0), ('bbox_mAP_l', 0.471), ('bbox_mAP_copypaste', '0.430 0.748 0.439 -1.000 -1.000 0.471')])
Traceback (most recent call last):
File "eval.py", line 46, in <module>
main()
File "eval.py", line 42, in main
evaluator(**eval_args)
File "/home/mrs/utils/training_extensions/ote/ote/modules/evaluators/base.py", line 38, in __call__
self._evaluate_internal(config, snapshot, out, update_config, metrics_functions, **kwargs)
File "/home/mrs/utils/training_extensions/ote/ote/modules/evaluators/base.py", line 75, in _evaluate_internal
with open(out, 'w') as write_file:
PermissionError: [Errno 13] Permission denied: '/metrics.yaml'
Also, a question - does it output mAP score for each class?
Also, when installing on a fresh machine, got this error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behavior is the source of the following dependency conflicts.
ote 0.2 requires mmcv==1.3.9, which is not installed.
Though command pip list | grep mm
returns:
mmcv-full 1.3.0
mmdet 2.9.0 /home/stasiyur/training_extensions/external/mmdetection
mmlvis 10.5.3
mmpycocotools 12.0.3
And the ote is installed:
ote 0.2 /home/stasiyur/training_extensions/ote
And nothing in the init_venv.sh file was changed. I am trying to investigate the issue so far.
I inited the training with lower LR (took a look at the last epoch in best training) with 0.002, but the tensorboard shows that the training begun with higher LR.
I see, this is strange. Need more time to figure out, what does cause this. Do you run the training in the same environment (with non-working export of mmdet)?
The eval.py also ends with an error, but it proves the evaluation of the model that was trained before has mAP as on the screenshot of tensorboard shows.
Ok, this gives us certainty the model is saved and loaded correctly.
Also, a question - does it output mAP score for each class?
If you mean AP (becase, mAP is mean AP - averaged on all classes), you have to add --options "classwise=True" when using the test script or add classwise=True in the evaluation config for training-time evaluation.
See details here.
Also, when installing on a fresh machine, got this error:
You can ignore this error or install mmcv==1.3.9 manually
If you mean AP (becase, mAP is mean AP - averaged on all classes), you have to add --options "classwise=True" when using Ithe test script or add classwise=True in the evaluation config for training-time evaluation.
The options argument doesn't work and is not recognized.
I see, this is strange. Need more time to figure out, what does cause this. Do you run the training in the same environment (with non-working export of mmdet)?
Yes, it is run on that machine
So, I have made a clean installation on the current server (with RTX 3090, since the cluster has a problem with torch compilation), and the collect env log looks like this:
sys.platform: linux
Python: 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0]
CUDA available: True
GPU 0: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.4.r11.4/compiler.30033411_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.8.1+cu111
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.0.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.9.1+cu111
OpenCV: 4.5.1-openvino
MMCV: 1.3.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.4
MMDetection: 2.9.0+4856591
MMDetection Compiler: GCC 9.3
MMDetection CUDA Compiler: 11.4
NNCF: 1.7.0
ONNX: 1.10.1
ONNXRuntime: 1.8.1
OpenVINO MO: 2021.2.0-1877-176bdf51370-releases/2021/2
OpenVINO IE: 2.1.2021.2.0-1877-176bdf51370-releases/2021/2
Everything seems to be working, becides initiating from a pre-learned model, the log seems to state:
2021-09-05 16:51:27,941 - mmdet - INFO - workflow: [('train', 1)], max: 1000 epochs
INFO:mmdet:workflow: [('train', 1)], max: 1000 epochs
/home/mrs/deep_learning/training_extensions/models/object_detection/det_venv/lib/python3.8/site-packages/mmcv/runner/hooks/logger/text.py:55: DeprecationWarning: an integer is required (got type float). Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.
mem_mb = torch.tensor([mem / (1024 * 1024)],
2021-09-05 16:51:41,867 - mmdet - INFO - Epoch [91][1/1479] lr: 4.901e-02, eta: 2165 days, 19:10:33, time: 13.903, data_time: 12.249, memory: 10999, loss_cls: 0.8771, loss_bbox: 0.4694, loss: 1.3466
INFO:mmdet:Epoch [91][1/1479] lr: 4.901e-02, eta: 2165 days, 19:10:33, time: 13.903, data_time: 12.249, memory: 10999, loss_cls: 0.8771, loss_bbox: 0.4694, loss: 1.3466
2021-09-05 16:51:41,879 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
INFO:mmcv:Reducer buckets have been rebuilt in this iteration.
I'll leave it to see if the initial experiment will prove to repeat or there is some deep bug.
I might have a clue of what is going on: The learning rate is not going down as it was doing before. This is current experiment, with the last version of the training_extensions and all the packages:
And the best-trained example:
As you can see, the learning rate doesn't decrease as fast as it was last time, the tensorboard shows also the loss is oscillating a lot, which might be the result of the learning rate not decaying.
If you mean AP (becase, mAP is mean AP - averaged on all classes), you have to add --options "classwise=True" when using Ithe test script or add classwise=True in the evaluation config for training-time evaluation.
The options argument doesn't work and is not recognized.
Sorry, I meant, add classwise=True
in the test part of the config.
As you can see, the learning rate doesn't decrease as fast as it was last time
As I saw earlier, you have set the number of epochs to 1000. The LR policy is Cosine Annealing, this means, it will start from the initial value and will be reduced to the min_lr
on the cosine law (as the standard cos function acts in the [0, pi] interval). If I interpret this correctly, this means, that in every iteration LR will be calculated using the following rule:
lr = min_lr + 0.5 * (start_lr - end_lr) * [cos(pi * num_epochs) + 1]
The LR policy is Cosine Annealing, this means, it will start from the initial value and will be reduced to the
min_lr
on the cosine law
Oh, I see, I've just taken a look at the scheduler. Okay, If I start fine-tuning the model and set a number of epochs to 20 for example (resuming from 100 and adding 20 more). Then the question arises, what should be the starting LR then for fine-tuning?
I've investigated the problem, and here are the results:
--base-learning-rate
command line option or in the template.yaml
(lr in the model.py
is ignored actually).--load-from
or --resume-from
).
Thus, in order to finetune the model use --load-weights
combined with --base-learning-rate
. Adapt the LR policy and warmup in the model.py if needed.
As I can see from the log you provided a couple of messages before, the loss_cls
and loss_box
are near the loss values in the end of the first train (0.3 & 0.8, respectively). In the meantime, I've tested out training using --resume-from and base-learning-rate from previous training, and set 20 epochs. The results seem to be promising:
2\. **Thus**, in order to finetune the model use `--load-weights` combined with `--base-learning-rate`. Adapt the LR policy and warmup in the model.py if needed.
Thanks for your answer, and appreciate your effort! I'll try out this experiment. Though I am having hard time figuring out what should be the number of epochs. If I set the number too high the previous issues may arouse.
If you use --load-weights
set the exact number of epochs you want to finetune the model. It will be treated as a new experiment, just the model will be loaded from the checkpoint and not initialized from scratch. =)
Okay, so I've started two experiments:
--resume-from
- up to 200 epochs (since it has proven to train nicely for 20 epochs, I want to try it out)--load-weights
for another 100 epochsSo far, Ive seen that if I use --load-weights
the classification loss goes 2x times how it was initially in the last epoch and the learning rate jumps, even though I've used the --base-learning-rate
argument
the learning rate jumps
What do you mean by "jumps"? What LR policy and warmup do you use?
What do you mean by "jumps"? What LR policy and warmup do you use? The built-in one in the templates
lr_config = dict( policy='CosineAnnealing', min_lr=0.00001, warmup='linear', warmup_iters=100, warmup_ratio=0.1)
The jumps, was a mistake in my understanding. Though the metrics goes down and slowly built-up, as in the screenshot, this is using ``load-weights`` and additional 100 epochs and 48 epochs into training so far
![image](https://user-images.githubusercontent.com/13340448/132300554-8be9e2ed-9fd4-4801-9b0a-399ab5f88031.png)
Using the step scheduler, that is recommended in the mmdetection documentation for fine-tuning, I've been able to achieve this, but it's only 35 epochs so far. Though the learning rate in my opinion is still high, and there is no way to change it. Maybe you have a suggestion for the learning scheduler?
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=100,
warmup_ratio=0.0001,
step=[7])
As I can see from the screenshot, you finetune with the --resume-from
. I may only suggest using --load-weights
instead, this will allow to reduce the LR. You can also add LR drops: step=[7, 14, 21, 28]
. This will drop the LR 10x times on the 7th, 14th, 21st and 28th epoch.
Thanks for the advice. I've read the cyclic learning scheduler, and run an experiment with it and another one with the step scheduler like you suggested. Let's see if it would result in some improvement.
I've also spotted some incosistency between .log and .log.json in the values: .log
2021-09-08 18:10:01,359 - mmdet - INFO - Epoch [23][70/1682] lr: 1.652e-05, eta: 2 days, 17:49:40, time: 0.782, data_time: 0.305, memory: 10992, loss_cls: 0.8845, loss_bbox: 0.3341, loss: 1.2186
2021-09-08 18:10:07,434 - mmdet - INFO - Epoch [23][80/1682] lr: 1.652e-05, eta: 2 days, 17:49:17, time: 0.607, data_time: 0.142, memory: 10992, loss_cls: 0.8879, loss_bbox: 0.3360, loss: 1.2238
2021-09-08 18:10:15,905 - mmdet - INFO - Epoch [23][90/1682] lr: 1.651e-05, eta: 2 days, 17:49:13, time: 0.847, data_time: 0.394, memory: 10992, loss_cls: 0.8894, loss_bbox: 0.3615, loss: 1.2508
2021-09-08 18:10:22,506 - mmdet - INFO - Epoch [23][100/1682] lr: 1.651e-05, eta: 2 days, 17:48:55, time: 0.660, data_time: 0.191, memory: 10992, loss_cls: 0.9045, loss_bbox: 0.3421, loss: 1.2466
2021-09-08 18:10:30,199 - mmdet - INFO - Epoch [23][110/1682] lr: 1.651e-05, eta: 2 days, 17:48:45, time: 0.769, data_time: 0.312, memory: 10992, loss_cls: 0.8511, loss_bbox: 0.2994, loss: 1.1506
2021-09-08 18:10:36,981 - mmdet - INFO - Epoch [23][120/1682] lr: 1.651e-05, eta: 2 days, 17:48:28, time: 0.678, data_time: 0.214, memory: 10992, loss_cls: 0.8732, loss_bbox: 0.3309, loss: 1.2041
2021-09-08 18:10:45,242 - mmdet - INFO - Epoch [23][130/1682] lr: 1.651e-05, eta: 2 days, 17:48:23, time: 0.826, data_time: 0.337, memory: 10992, loss_cls: 0.8140, loss_bbox: 0.3023, loss: 1.1163
2021-09-08 18:10:51,895 - mmdet - INFO - Epoch [23][140/1682] lr: 1.651e-05, eta: 2 days, 17:48:05, time: 0.665, data_time: 0.194, memory: 10992, loss_cls: 0.8054, loss_bbox: 0.2746, loss: 1.0800
2021-09-08 18:11:01,365 - mmdet - INFO - Epoch [23][150/1682] lr: 1.650e-05, eta: 2 days, 17:48:09, time: 0.947, data_time: 0.476, memory: 10992, loss_cls: 0.8746, loss_bbox: 0.3318, loss: 1.2064
.json
{"mode": "train", "epoch": 23, "iter": 70, "lr": 2e-05, "memory": 10992, "data_time": 0.30503, "loss_cls": 0.88451, "loss_bbox": 0.33411, "loss": 1.21862, "time": 0.7816}
{"mode": "train", "epoch": 23, "iter": 80, "lr": 2e-05, "memory": 10992, "data_time": 0.14151, "loss_cls": 0.88786, "loss_bbox": 0.33597, "loss": 1.22383, "time": 0.60734}
{"mode": "train", "epoch": 23, "iter": 90, "lr": 2e-05, "memory": 10992, "data_time": 0.39376, "loss_cls": 0.88938, "loss_bbox": 0.36147, "loss": 1.25085, "time": 0.84687}
{"mode": "train", "epoch": 23, "iter": 100, "lr": 2e-05, "memory": 10992, "data_time": 0.1913, "loss_cls": 0.90445, "loss_bbox": 0.34213, "loss": 1.24658, "time": 0.66038}
{"mode": "train", "epoch": 23, "iter": 110, "lr": 2e-05, "memory": 10992, "data_time": 0.3117, "loss_cls": 0.85114, "loss_bbox": 0.29945, "loss": 1.15059, "time": 0.76913}
{"mode": "train", "epoch": 23, "iter": 120, "lr": 2e-05, "memory": 10992, "data_time": 0.21424, "loss_cls": 0.87322, "loss_bbox": 0.33092, "loss": 1.20414, "time": 0.67828}
{"mode": "train", "epoch": 23, "iter": 130, "lr": 2e-05, "memory": 10992, "data_time": 0.3365, "loss_cls": 0.81401, "loss_bbox": 0.30232, "loss": 1.11633, "time": 0.82565}
{"mode": "train", "epoch": 23, "iter": 140, "lr": 2e-05, "memory": 10992, "data_time": 0.19364, "loss_cls": 0.80539, "loss_bbox": 0.27459, "loss": 1.07999, "time": 0.66529}
{"mode": "train", "epoch": 23, "iter": 150, "lr": 2e-05, "memory": 10992, "data_time": 0.47623, "loss_cls": 0.8746, "loss_bbox": 0.33185, "loss": 1.20644, "time": 0.94724}
Does the learning rate seems to differ, or I am misinterpreting the data?
Actually, it does not. Log from json (2e-05) is a rounded value from .log file (1.65e-05). I did not look in detail under the hood, but it seems to be rounding features of the logger.
Ok, after long training without any significant improvement, I've decided to test bigger input model the 512 one. And it does not even achieve the previous model performance and it gives such a warning during initialization:
size mismatch for bbox_head.cls_convs.0.3.weight: copying a param with shape torch.Size([324, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([40, 96, 1, 1]).
size mismatch for bbox_head.cls_convs.0.3.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([40]).
size mismatch for bbox_head.cls_convs.1.3.weight: copying a param with shape torch.Size([405, 320, 1, 1]) from checkpoint, the shape in current model is torch.Size([50, 320, 1, 1]).
size mismatch for bbox_head.cls_convs.1.3.bias: copying a param with shape torch.Size([405]) from checkpoint, the shape in current model is torch.Size([50]).
Also, when using the pre-trained weights for 512x512 model input - it goes into overfitting and the mAP gradually goes to zero. When avoiding the pre-trained weights - it trains smoothly: orange is with pre-trained from Github and the blue one without. The data in both the experiments is the same
Without the details about learning rate value, policy, warmup it is hard to tell what is the reason of such phenomenon. I may only suggest training from scratch if it suits your task well. ;)
I've already trained without the initial weights. And it works like a charm. The initial setup was the one provided in the repo. The message was more about a possible problem with the initial weights, since they don't match with the architecture, as the PyTorch says.
And a question, the openvino API performs the non-max suppression by itself in the background? If yes, how this is done during the transition from the PyTorch model to IR format?
The message was more about a possible problem with the initial weights, since they don't match with the architecture, as the PyTorch says.
Again, I cannot say anything concrete without the details about which model and which weights were used. Could you please share the details (config file if standard model was used and which weights do you use, url or something)?
And a question, the openvino API performs the non-max suppression by itself in the background?
Yes, the NMS is supported.
If yes, how this is done during the transition from the PyTorch model to IR format?
Depends on what you mean. If you mean how to convert model, than standard piepline is used (PyTorch -> ONNX -> OpenVINO IR). The export is performed by this script. If you are interested in the details of implementation under the hood, than, you can refer to the openvino main repo: https://github.com/openvinotoolkit/openvino
2. If use --resume-from, LR is loaded from the checkpoint and cannot be changed.
snapshot.pth
before actually starting the training?external/mmdetection/tools/train.py
just takes a single parameter resume-from
. It does not have a load-weights
at all! In this case, how do you process this? I am trying to understand how mmdetection understands the difference between load-weights
and resume-from
?
- If use --resume-from, LR is loaded from the checkpoint and cannot be changed.
- So do we have a way of inferring this LR from pre-trained
snapshot.pth
before actually starting the training?- The
external/mmdetection/tools/train.py
just takes a single parameterresume-from
. It does not have aload-weights
at all! In this case, how do you process this? I am trying to understand how mmdetection understands the difference betweenload-weights
andresume-from
?
resume_from
, LR will be loaded from the snapshot and the whole process of the training will be continued from the very place it stopped before. If you mean how can we check what is the value of LR in the snapshot, we can load it using torch's load and look the the dict.--load-weights
is an parameter for the ote. ote is roughly speaking a front end of the training process, whereas mmdetection is a backend. This means, ote processes parameters and calls mmdetection functions. Under the hood resume-from
calls mmcv's resume
function, and load-weights
(which is transformed to load-from
under the hood of ote) just loads checkpoint without loading a state of the optimizer:
https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/base_runner.py#L332So the whole process is the following: ote processes model template and passes parameters to the mmdetection, which uses mmcv's runner for the training process. It is complicated, but it is the price we pay for the modularity and flexibility.
Hi! Using the training scripts I was able to train the model and reach around 40% mAP on my dataset. Here is the tensorboard log
I have around 50k images and around 5k per class (9 classes in total). The main question is that when I start training from the best-saved model the training kinda goes randomly (just oscilates) and then goes down. Is there a special setting for the finetuning in your setup ? And here is tensorboard from starting training not from snapshot.pth but from my last best model