Finetuning a trained model

WingRS commented 3 years ago

Hi! Using the training scripts I was able to train the model and reach around 40% mAP on my dataset. Here is the tensorboard log Selection_096

I have around 50k images and around 5k per class (9 classes in total). The main question is that when I start training from the best-saved model the training kinda goes randomly (just oscilates) and then goes down. Is there a special setting for the finetuning in your setup ? And here is tensorboard from starting training not from snapshot.pth but from my last best model Selection_097

morkovka1337 commented 3 years ago

Hi. I guess, this is about the face detection, but if not, anyway, the back-end part of all the object detection models is the same. In the face detection documentation we have:

If you would like to start training from pre-trained weights use --load-weights pararmeter instead of --resume-from. Also you can use parameters such as --epochs, --batch-size, --gpu-num, --base-learning-rate, otherwise default values will be loaded from ${MODEL_TEMPLATE}.

The difference between --load-weights and --resume-from is that in first case only weights of the model are loaded and in the second model and optimizer conditions are loaded. So, probably, this is what you are looking for.

Speaking about this:

The main question is that when I start training from the best-saved model the training kinda goes randomly (just oscilates) and then goes down.

— I guess, you load the weights, but the optimizer is initialized from scratch, so it takes some time to collect statistics like running means.

How do you run finetuning? With --load-weights or with --resume-from?

WingRS commented 3 years ago

I've run it with resume-from and it seems to be better, but doesn't achieve the same mAP score. Also I have found out that my server config was wrong, the OBJ_DET_DIR was pointing to wrong directory. But it still was working, can this influence the results?

morkovka1337 commented 3 years ago

Could you please attach an updated screenshot of the training log? About

the OBJ_DET_DIR was pointing to wrong directory. But it still was working, can this influence the results?

If you are training and finetuning the model on the same dataset, this should work ok. If you train on one dataset and finetune on the other, well, in general, it is not guaranteed, the mAP would be higher.

If I understand correctly, you train the model on one specific dataset, the training goes good enough. Then you use the same dataset and finetune the same model, but the mAP goes down, and, moreover, after the finetuning the loss value initially is about 2, whereas in the end of the first training it was about 1 (from the screenshots)?

WingRS commented 3 years ago

Yeah, you understood it correctly, I've changed the obj det dir to correct one, run it again and it went even worse. The training log:

2021-09-02 10:45:11,103 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.10 (default, Jun  2 2021, 10:49:15) [GCC 9.4.0]
CUDA available: True
GPU 0: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.4.r11.4/compiler.30033411_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.8.1+cu111
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.9.1+cu111
OpenCV: 4.5.3
MMCV: 1.3.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.4
MMDetection: 2.9.0+
MMDetection Compiler: GCC 9.3
MMDetection CUDA Compiler: 11.4
NNCF: 1.7.0
ONNX: 1.10.1
ONNXRuntime: None
OpenVINO MO: None
OpenVINO IE: None
------------------------------------------------------------

2021-09-02 10:45:11,301 - mmdet - INFO - Distributed training: True
2021-09-02 10:45:11,485 - mmdet - INFO - Config:
input_size = 384
image_width = 384
image_height = 384
width_mult = 1.0
model = dict(
    type='SingleStageDetector',
    backbone=dict(
        type='mobilenetv2_w1',
        out_indices=(4, 5),
        frozen_stages=-1,
        norm_eval=False,
        pretrained=True),
    neck=None,
    bbox_head=dict(
        type='SSDHead',
        num_classes=9,
        in_channels=(96, 320),
        anchor_generator=dict(
            type='SSDAnchorGeneratorClustered',
            strides=(16, 32),
            widths=[[
                17.665686318905415, 40.73450634200414, 117.61498788545624,
                64.34307112517054
            ],
                    [
                        94.72263671830719, 173.1972153968913,
                        320.2371854303864, 207.336830535971, 352.20547313334856
                    ]],
            heights=[[
                22.150579702734092, 68.24921767068975, 68.97260088862039,
                148.00114686179387
            ],
                     [
                         265.86875666086945, 166.20475919582213,
                         143.7800147372461, 310.2971364875696,
                         330.45387885029794
                     ]]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=(0.0, 0.0, 0.0, 0.0),
            target_stds=(0.1, 0.1, 0.2, 0.2)),
        depthwise_heads=True,
        depthwise_heads_activations='relu',
        loss_balancing=True),
    train_cfg=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.4,
            neg_iou_thr=0.4,
            min_pos_iou=0.0,
            ignore_iof_thr=-1,
            gt_max_assign_all=False),
        smoothl1_beta=1.0,
        use_giou=False,
        use_focal=False,
        allowed_border=-1,
        pos_weight=-1,
        neg_pos_ratio=3,
        debug=False),
    test_cfg=dict(
        nms=dict(type='nms', iou_threshold=0.45),
        min_bbox_size=0,
        score_thr=0.02,
        max_per_img=200,
        nms_pre_classwise=200))
cudnn_benchmark = True
dataset_type = 'CocoDataset'
img_norm_cfg = dict(mean=[0, 0, 0], std=[255, 255, 255], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile', to_float32=True),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='PhotoMetricDistortion',
        brightness_delta=32,
        contrast_range=(0.5, 1.5),
        saturation_range=(0.5, 1.5),
        hue_delta=18),
    dict(
        type='MinIoURandomCrop',
        min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
        min_crop_size=0.1),
    dict(type='Resize', img_scale=(384, 384), keep_ratio=False),
    dict(type='Normalize', mean=[0, 0, 0], std=[255, 255, 255], to_rgb=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(384, 384),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=False),
            dict(
                type='Normalize',
                mean=[0, 0, 0],
                std=[255, 255, 255],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=64,
    workers_per_gpu=4,
    train=dict(
        type='RepeatDataset',
        times=5,
        dataset=dict(
            type='CocoDataset',
            ann_file='/home/mrs/utils/labels_coco/train.json',
            img_prefix='/home/mrs/Documents/system_dataset',
            pipeline=[
                dict(type='LoadImageFromFile', to_float32=True),
                dict(type='LoadAnnotations', with_bbox=True),
                dict(
                    type='PhotoMetricDistortion',
                    brightness_delta=32,
                    contrast_range=(0.5, 1.5),
                    saturation_range=(0.5, 1.5),
                    hue_delta=18),
                dict(
                    type='MinIoURandomCrop',
                    min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
                    min_crop_size=0.1),
                dict(type='Resize', img_scale=(384, 384), keep_ratio=False),
                dict(
                    type='Normalize',
                    mean=[0, 0, 0],
                    std=[255, 255, 255],
                    to_rgb=True),
                dict(type='RandomFlip', flip_ratio=0.5),
                dict(type='DefaultFormatBundle'),
                dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
            ],
            classes=[
                'rope', 'helmet', 'cellphone', 'backpack', 'survivor', 'cube',
                'vent', 'drill', 'fire_extinguisher'
            ])),
    val=dict(
        type='CocoDataset',
        ann_file='/home/mrs/utils/labels_coco/val.json',
        img_prefix='/home/mrs/Documents/system_dataset',
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(384, 384),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=False),
                    dict(
                        type='Normalize',
                        mean=[0, 0, 0],
                        std=[255, 255, 255],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=[
            'rope', 'helmet', 'cellphone', 'backpack', 'survivor', 'cube',
            'vent', 'drill', 'fire_extinguisher'
        ]),
    test=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_val2017.json',
        img_prefix='data/coco/val2017',
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(384, 384),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=False),
                    dict(
                        type='Normalize',
                        mean=[0, 0, 0],
                        std=[255, 255, 255],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=[
            'rope', 'helmet', 'cellphone', 'backpack', 'survivor', 'cube',
            'vent', 'drill', 'fire_extinguisher'
        ]))
optimizer = dict(type='SGD', lr=0.05, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(
    policy='CosineAnnealing',
    min_lr=1e-05,
    warmup='linear',
    warmup_iters=100,
    warmup_ratio=0.1)
checkpoint_config = dict(interval=1)
log_config = dict(
    interval=10,
    hooks=[dict(type='TextLoggerHook'),
           dict(type='TensorboardLoggerHook')])
total_epochs = 1000
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = '/home/mrs/utils/mobilenet_v2_384/resume_90_correct'
load_from = ''
resume_from = '/home/mrs/utils/models_384/big/epoch_90.pth'
workflow = [('train', 1)]
gpu_ids = range(0, 1)

2021-09-02 10:45:13,594 - mmdet - INFO - load checkpoint from /home/mrs/utils/models_384/big/epoch_90.pth
2021-09-02 10:45:13,594 - mmdet - INFO - Use load_from_local loader
2021-09-02 10:45:13,641 - mmdet - INFO - resumed epoch 90, iter 133110
2021-09-02 10:45:13,641 - mmdet - INFO - Start running, host: mrs@gigabedna-focal, work_dir: /home/mrs/utils/mobilenet_v2_384/resume_90_correct
2021-09-02 10:45:13,641 - mmdet - INFO - workflow: [('train', 1)], max: 1000 epochs
2021-09-02 10:45:27,182 - mmdet - INFO - Epoch [91][1/1479] lr: 4.901e-02, eta: 2105 days, 22:56:47, time: 13.519, data_time: 11.897, memory: 10999, loss_cls: 1.1630, loss_bbox: 0.6857, loss: 1.8487

morkovka1337 commented 3 years ago

run it again and it went even worse

What I see after the first iteration, the loss is at least the same, as it was in the end of the main tuning. How do you make a conclusion that it goes worse? Could you, please, attach the log of the loss and validation mAP after several epochs or tensorboard screenshot? Also, usually, when the model is finetuned, it is tuned with less LR. If you use the same learning rate for finetuning, it is not guaranteed the model's loss will be lower.

WingRS commented 3 years ago

I'd assume that the mAP on the start would be somewhat the same as for the last epochs in previous model, am I mistaken?

morkovka1337 commented 3 years ago

I'd assume that the mAP on the start would be somewhat the same as for the last epochs in previous model, am I mistaken?

Yes, it should be. As I see from the first screenshot, in the end of the tuning the mAP was about 0.45. What is the mAP in the beginning of the finetuning (with the fixed directory)? I can not see any mentions of the mAP in the log you provided.

Also, I see from the log that you use cuda 11.1, in general, it is not compatible with our mmdetection, we use 10.2. I saw one of your previous issues which looks similar to this one. Could you, please do the step from the last issue:

run python external/mmdetection/mmdet/utils/collect_env.py and share the output?

Do you use our version of the mmdetection or mmcv one?

WingRS commented 3 years ago

Output of the command using the detection env

Traceback (most recent call last):
  File "external/mmdetection/mmdet/utils/collect_env.py", line 6, in <module>
    import mmdet
ModuleNotFoundError: No module named 'mmdet'

The installation seems to be successful, and the first training was fine, so I assumed it would work with my cuda. The output of the tensorboard with fixed directory

WingRS commented 3 years ago

Also output of the command you provided with the venv in source directory:

Traceback (most recent call last):
  File "external/mmdetection/mmdet/utils/collect_env.py", line 3, in <module>
    from mmcv.utils import collect_env as collect_base_env
ImportError: cannot import name 'collect_env' from 'mmcv.utils' (/home/mrs/utils/training_extensions/venv/lib/python3.8/site-packages/mmcv/utils/__init__.py)

morkovka1337 commented 3 years ago

I've just cloned this repository, created a new environment and installed all the packages from scratch using this file. collect_env.py works good for me, but note, I have cuda 10.2 installed. You can try to install mmcv-full==1.3.3 instead of mmcv or downgrade the cuda to 10.2 and reinstall the environment (though, I'm not sure your RTX 3090 supports 10.2).

morkovka1337 commented 3 years ago

Also output of the command you provided with the venv in source directory:

Did not get it, you have two environments, or the first time you run collect_env.py without installing the packages from init_venv.sh?

WingRS commented 3 years ago

I do have mmcv, 1.3.0, but not mmdet, that's the issue from what I see. Yes, so the first venv is from the README in the root directory of training_extensions, and the second one in the models/object_detection

morkovka1337 commented 3 years ago

mmdet should be installed while running init_venv.sh from the models/object_detection. The script in the models/object_detection is all-sufficient.

Could you, please, share the output of the pip list?

WingRS commented 3 years ago

Here is the pip list

Package                         Version     Location
------------------------------- ----------- ---------------------------------------
absl-py                         0.13.0
actionlib                       1.13.2
addict                          2.4.0
angles                          1.9.13
astroid                         2.5.8
attrs                           21.2.0
base_local_planner              1.17.1
bondpy                          1.8.6
cachetools                      4.2.2
camera-calibration              1.15.3
camera-calibration-parsers      1.12.0
catkin                          0.8.9
certifi                         2021.5.30
charset-normalizer              2.0.4
controller-manager              0.19.4
controller-manager-msgs         0.19.4
cv-bridge                       1.15.0
cycler                          0.10.0
Cython                          0.29.24
defusedxml                      0.7.1
diagnostic-analysis             1.10.3
diagnostic-common-diagnostics   1.10.3
diagnostic-updater              1.10.3
dynamic-reconfigure             1.7.1
editdistance                    0.5.3
gazebo_plugins                  2.9.1
gazebo_ros                      2.9.1
gencpp                          0.6.5
geneus                          3.0.0
genlisp                         0.4.18
genmsg                          0.5.16
gennodejs                       2.0.2
genpy                           0.6.14
google-auth                     1.34.0
google-auth-oauthlib            0.4.5
graphviz                        0.17
grpcio                          1.39.0
idna                            3.2
image-geometry                  1.15.0
iniconfig                       1.1.1
interactive-markers             1.12.0
isort                           5.9.3
joblib                          1.0.1
joint_state_publisher           1.15.0
joint_state_publisher_gui       1.15.0
jsonschema                      3.2.0
jstyleson                       0.0.2
kiwisolver                      1.3.1
laser_geometry                  1.6.7
lazy-object-proxy               1.6.0
Markdown                        3.3.4
matplotlib                      3.4.3
mccabe                          0.6.1
message-filters                 1.15.9
mmcv-full                       1.3.0
mmpycocotools                   12.0.3
natsort                         7.1.1
networkx                        2.6.2
ninja                           1.10.2
nncf                            1.7.0
numpy                           1.19.5
oauthlib                        3.1.1
onnx                            1.10.1
opencv-python                   4.5.3.56
ote                             0.2         /home/mrs/utils/training_extensions/ote
packaging                       21.0
pandas                          1.3.1
Pillow                          8.3.1
pip                             21.2.4
pkg_resources                   0.0.0
pluggy                          0.13.1
Polygon3                        3.0.8
protobuf                        3.17.3
py                              1.10.0
py-trees                        0.7.6
pyasn1                          0.4.8
pyasn1-modules                  0.2.8
pydot                           1.4.2
pylint                          2.7.2
pyparsing                       2.4.7
pyrsistent                      0.18.0
pytest                          6.2.4
python-dateutil                 2.8.2
python-qt-binding               0.4.3
pytorchcv                       0.0.66
pytz                            2021.1
PyYAML                          5.4.1
qt-dotgraph                     0.4.2
qt-gui                          0.4.2
qt-gui-cpp                      0.4.2
qt-gui-py-common                0.4.2
requests                        2.26.0
requests-oauthlib               1.3.0
resource_retriever              1.12.6
ros_numpy                       0.0.4
rosapi                          0.11.13
rosbag                          1.15.9
rosboost-cfg                    1.15.7
rosbridge_library               0.11.13
rosbridge_server                0.11.13
rosclean                        1.15.7
roscreate                       1.15.7
rosdoc_lite                     0.2.10
rosgraph                        1.15.9
roslaunch                       1.15.9
roslib                          1.15.7
roslint                         0.12.0
roslz4                          1.15.9
rosmake                         1.15.7
rosmaster                       1.15.9
rosmsg                          1.15.9
rosnode                         1.15.9
rosparam                        1.15.9
rospy                           1.15.9
rosservice                      1.15.9
rostest                         1.15.9
rostopic                        1.15.9
rosunit                         1.15.7
roswtf                          1.15.9
rqt_action                      0.4.9
rqt_bag                         0.5.1
rqt_bag_plugins                 0.5.1
rqt_console                     0.4.11
rqt-controller-manager          0.19.4
rqt_dep                         0.4.10
rqt-ez-publisher                0.6.1
rqt_graph                       0.4.14
rqt_gui                         0.5.2
rqt_gui_py                      0.5.2
rqt_image_view                  0.4.16
rqt_joint_trajectory_controller 0.18.1
rqt_launch                      0.4.9
rqt_logger_level                0.4.11
rqt-moveit                      0.5.9
rqt_msg                         0.4.9
rqt-multiplot                   0.0.12
rqt_nav_view                    0.5.7
rqt_plot                        0.4.13
rqt_pose_view                   0.5.10
rqt_publisher                   0.4.9
rqt_py_common                   0.5.2
rqt_py_console                  0.4.9
rqt_py_trees                    0.4.0
rqt-reconfigure                 0.5.3
rqt-robot-dashboard             0.5.8
rqt-robot-monitor               0.5.13
rqt_robot_steering              0.5.12
rqt_runtime_monitor             0.5.8
rqt-rviz                        0.6.1
rqt_service_caller              0.4.9
rqt_shell                       0.4.10
rqt_srv                         0.4.8
rqt_tf_tree                     0.6.2
rqt_top                         0.4.9
rqt_topic                       0.4.12
rqt_web                         0.4.9
rsa                             4.7.2
rviz                            1.14.5
scikit-learn                    0.24.2
scipy                           1.7.1
sensor-msgs                     1.13.1
setuptools                      44.0.0
six                             1.16.0
smach                           2.5.0
smach-ros                       2.5.0
smach_viewer                    3.0.1
smclib                          1.8.6
subt_comms_test                 0.1.0
tensorboard                     2.6.0
tensorboard-data-server         0.6.1
tensorboard-plugin-wit          1.8.0
terminaltables                  3.1.0
test-generator                  0.1.1
texttable                       1.6.4
tf                              1.13.2
tf-conversions                  1.13.2
tf2-geometry-msgs               0.7.5
tf2-kdl                         0.7.5
tf2-py                          0.7.5
tf2-ros                         0.7.5
tf2-sensor-msgs                 0.7.5
threadpoolctl                   2.2.0
toml                            0.10.2
topic-tools                     1.15.9
torch                           1.8.1+cu111
torchvision                     0.9.1+cu111
tqdm                            4.62.0
typing-extensions               3.10.0.0
unique_id                       1.0.6
urllib3                         1.26.6
Werkzeug                        2.0.1
wheel                           0.37.0
wrapt                           1.12.1
xacro                           1.14.6
yapf                            0.31.0

morkovka1337 commented 3 years ago

Ok, you can try to install mmdetection manually. In the models/object_detection run: pip install -e ../../external/mmdetection/ -c constraints.txt. Or, you can reinitialize another environment using init_venv.sh script, and I insist on this option. This will guarantee that clean environment was created solely for the object detection and there are no any packages conflict (explicit or implicit).

WingRS commented 3 years ago

But can this really make the evaluation lower, since it's working and giving some results. Between the first training (where it achieved around 40% mAP) and the last one the only that thing changed is - moving the model from /tmp/my_model to local directory in the /home/

morkovka1337 commented 3 years ago

In the ideal situation everything should work as expected, but here are stranger things: mmdetection is not installed (as we see from the pip list and collect_env.py) as the package, but the training works. In this case I'm just not sure how the training (as, consequently, finetuning) works.

Ok, could you try to run the evaluation of your trained model, using separate script? Does it reproduce the same mAP values on the same val set?

WingRS commented 3 years ago

I'll do it on friday, and in the meantime I am uploading data to cluster which has cuda 10.2, let's see if the problem will be the same there. Thx for the help, I'll write about it tomorrow.

WingRS commented 3 years ago

Hi, so I've found out another issue. When installing on new machine the training_extensions the part of the init_venv.sh is failing. In particular the git submodule init ../../external, it returned with error in my case, and I needed to call it myself from the root

morkovka1337 commented 3 years ago

OK, watching for the further results!

WingRS commented 3 years ago

Also, I inited the training with lower LR (took a look at the last epoch in best training) with 0.002, but the tensorboard shows that the training begun with higher LR. The log

optimizer = dict(type='SGD', lr=0.0002, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(
    policy='CosineAnnealing',
    min_lr=1e-05,
    warmup='linear',
    warmup_iters=100,
    warmup_ratio=0.1)
checkpoint_config = dict(interval=1)
log_config = dict(
    interval=10,
    hooks=[dict(type='TextLoggerHook'),
           dict(type='TensorboardLoggerHook')])
total_epochs = 1000
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = '/home/mrs/utils/mobilenet_v2_384/resume_90_correct_lower_lr'
load_from = ''
resume_from = '/home/mrs/utils/epoch_90.pth'
workflow = [('train', 1)]
gpu_ids = range(0, 1)

2021-09-03 07:18:31,018 - mmdet - INFO - load checkpoint from /home/mrs/utils/epoch_90.pth
2021-09-03 07:18:31,018 - mmdet - INFO - Use load_from_local loader
2021-09-03 07:18:31,063 - mmdet - INFO - resumed epoch 90, iter 133110
2021-09-03 07:18:31,063 - mmdet - INFO - Start running, host: mrs@gigabedna-focal, work_dir: /home/mrs/utils/mobilenet_v2_384/resume_90_correct_lower_lr
2021-09-03 07:18:31,063 - mmdet - INFO - workflow: [('train', 1)], max: 1000 epochs
2021-09-03 07:18:44,815 - mmdet - INFO - Epoch [91][1/1479] lr: 4.901e-02, eta: 2138 days, 12:17:58, time: 13.728, data_time: 12.052, memory: 10999, loss_cls: 0.8792, loss_bbox: 0.5224, loss: 1.4016
2021-09-03 07:18:49,317 - mmdet - INFO - Epoch [91][10/1479]    lr: 4.901e-02, eta: 242 days, 3:20:20, time: 1.816, data_time: 1.240, memory: 10999, loss_cls: 1.3665, loss_bbox: 0.8127, loss: 2.1792
2021-09-03 07:18:55,506 - mmdet - INFO - Epoch [91][20/1479]    lr: 4.901e-02, eta: 125 days, 21:20:29, time: 0.619, data_time: 0.161, memory: 10999, loss_cls: 1.6301, loss_bbox: 2.1397, loss: 3.7698

And the tensorboard screenshot:

WingRS commented 3 years ago

The eval.py also ends with an error, but it proves the evaluation of the model that was trained before has mAP as on the screenshot of tensorboard shows. The log:

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.430
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.748
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.439
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.471
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.528
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.528
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.528
OrderedDict([('bbox_mAP', 0.43), ('bbox_mAP_50', 0.748), ('bbox_mAP_75', 0.439), ('bbox_mAP_s', -1.0), ('bbox_mAP_m', -1.0), ('bbox_mAP_l', 0.471), ('bbox_mAP_copypaste', '0.430 0.748 0.439 -1.000 -1.000 0.471')])
Traceback (most recent call last):
  File "eval.py", line 46, in <module>
    main()
  File "eval.py", line 42, in main
    evaluator(**eval_args)
  File "/home/mrs/utils/training_extensions/ote/ote/modules/evaluators/base.py", line 38, in __call__
    self._evaluate_internal(config, snapshot, out, update_config, metrics_functions, **kwargs)
  File "/home/mrs/utils/training_extensions/ote/ote/modules/evaluators/base.py", line 75, in _evaluate_internal
    with open(out, 'w') as write_file:
PermissionError: [Errno 13] Permission denied: '/metrics.yaml'

WingRS commented 3 years ago

Also, a question - does it output mAP score for each class?

WingRS commented 3 years ago

Also, when installing on a fresh machine, got this error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behavior is the source of the following dependency conflicts.
ote 0.2 requires mmcv==1.3.9, which is not installed.

Though command pip list | grep mm returns:

mmcv-full               1.3.0
mmdet                   2.9.0       /home/stasiyur/training_extensions/external/mmdetection
mmlvis                  10.5.3
mmpycocotools           12.0.3

And the ote is installed:

ote                     0.2         /home/stasiyur/training_extensions/ote

And nothing in the init_venv.sh file was changed. I am trying to investigate the issue so far.

morkovka1337 commented 3 years ago

I inited the training with lower LR (took a look at the last epoch in best training) with 0.002, but the tensorboard shows that the training begun with higher LR.

I see, this is strange. Need more time to figure out, what does cause this. Do you run the training in the same environment (with non-working export of mmdet)?

The eval.py also ends with an error, but it proves the evaluation of the model that was trained before has mAP as on the screenshot of tensorboard shows.

Ok, this gives us certainty the model is saved and loaded correctly.

Also, a question - does it output mAP score for each class?

If you mean AP (becase, mAP is mean AP - averaged on all classes), you have to add --options "classwise=True" when using the test script or add classwise=True in the evaluation config for training-time evaluation.

See details here.

Also, when installing on a fresh machine, got this error:

You can ignore this error or install mmcv==1.3.9 manually

WingRS commented 3 years ago

If you mean AP (becase, mAP is mean AP - averaged on all classes), you have to add --options "classwise=True" when using Ithe test script or add classwise=True in the evaluation config for training-time evaluation.

The options argument doesn't work and is not recognized.

WingRS commented 3 years ago

I see, this is strange. Need more time to figure out, what does cause this. Do you run the training in the same environment (with non-working export of mmdet)?

Yes, it is run on that machine

WingRS commented 3 years ago

So, I have made a clean installation on the current server (with RTX 3090, since the cluster has a problem with torch compilation), and the collect env log looks like this:

sys.platform: linux
Python: 3.8.10 (default, Jun  2 2021, 10:49:15) [GCC 9.4.0]
CUDA available: True
GPU 0: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.4.r11.4/compiler.30033411_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.8.1+cu111
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.1+cu111
OpenCV: 4.5.1-openvino
MMCV: 1.3.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.4
MMDetection: 2.9.0+4856591
MMDetection Compiler: GCC 9.3
MMDetection CUDA Compiler: 11.4
NNCF: 1.7.0
ONNX: 1.10.1
ONNXRuntime: 1.8.1
OpenVINO MO: 2021.2.0-1877-176bdf51370-releases/2021/2
OpenVINO IE: 2.1.2021.2.0-1877-176bdf51370-releases/2021/2

Everything seems to be working, becides initiating from a pre-learned model, the log seems to state:

2021-09-05 16:51:27,941 - mmdet - INFO - workflow: [('train', 1)], max: 1000 epochs
INFO:mmdet:workflow: [('train', 1)], max: 1000 epochs
/home/mrs/deep_learning/training_extensions/models/object_detection/det_venv/lib/python3.8/site-packages/mmcv/runner/hooks/logger/text.py:55: DeprecationWarning: an integer is required (got type float).  Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.
  mem_mb = torch.tensor([mem / (1024 * 1024)],
2021-09-05 16:51:41,867 - mmdet - INFO - Epoch [91][1/1479]     lr: 4.901e-02, eta: 2165 days, 19:10:33, time: 13.903, data_time: 12.249, memory: 10999, loss_cls: 0.8771, loss_bbox: 0.4694, loss: 1.3466
INFO:mmdet:Epoch [91][1/1479]   lr: 4.901e-02, eta: 2165 days, 19:10:33, time: 13.903, data_time: 12.249, memory: 10999, loss_cls: 0.8771, loss_bbox: 0.4694, loss: 1.3466
2021-09-05 16:51:41,879 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
INFO:mmcv:Reducer buckets have been rebuilt in this iteration.

I'll leave it to see if the initial experiment will prove to repeat or there is some deep bug.

WingRS commented 3 years ago

I might have a clue of what is going on: The learning rate is not going down as it was doing before. This is current experiment, with the last version of the training_extensions and all the packages:

And the best-trained example:

As you can see, the learning rate doesn't decrease as fast as it was last time, the tensorboard shows also the loss is oscillating a lot, which might be the result of the learning rate not decaying.

morkovka1337 commented 3 years ago

If you mean AP (becase, mAP is mean AP - averaged on all classes), you have to add --options "classwise=True" when using Ithe test script or add classwise=True in the evaluation config for training-time evaluation.

The options argument doesn't work and is not recognized.

Sorry, I meant, add classwise=True in the test part of the config.

As you can see, the learning rate doesn't decrease as fast as it was last time

As I saw earlier, you have set the number of epochs to 1000. The LR policy is Cosine Annealing, this means, it will start from the initial value and will be reduced to the min_lr on the cosine law (as the standard cos function acts in the [0, pi] interval). If I interpret this correctly, this means, that in every iteration LR will be calculated using the following rule: lr = min_lr + 0.5 * (start_lr - end_lr) * [cos(pi * num_epochs) + 1]

WingRS commented 3 years ago

The LR policy is Cosine Annealing, this means, it will start from the initial value and will be reduced to the min_lr on the cosine law

Oh, I see, I've just taken a look at the scheduler. Okay, If I start fine-tuning the model and set a number of epochs to 20 for example (resuming from 100 and adding 20 more). Then the question arises, what should be the starting LR then for fine-tuning?

morkovka1337 commented 3 years ago

I've investigated the problem, and here are the results:

If use --load-weights, LR will be updated if it is changed via --base-learning-rate command line option or in the template.yaml (lr in the model.py is ignored actually).
If use --resume-from, LR is loaded from the checkpoint and cannot be changed. I've also checked the finetuning of a person-detection model. In my case there were no significant differences in the loss values during several first iterations depending on how the model was loaded (via the --load-from or --resume-from). Thus, in order to finetune the model use --load-weights combined with --base-learning-rate. Adapt the LR policy and warmup in the model.py if needed. As I can see from the log you provided a couple of messages before, the loss_cls and loss_box are near the loss values in the end of the first train (0.3 & 0.8, respectively).

WingRS commented 3 years ago

In the meantime, I've tested out training using --resume-from and base-learning-rate from previous training, and set 20 epochs. The results seem to be promising:

WingRS commented 3 years ago

2\. **Thus**, in order to finetune the model use `--load-weights` combined with `--base-learning-rate`. Adapt the LR policy and warmup in the model.py if needed.

Thanks for your answer, and appreciate your effort! I'll try out this experiment. Though I am having hard time figuring out what should be the number of epochs. If I set the number too high the previous issues may arouse.

morkovka1337 commented 3 years ago

If you use --load-weights set the exact number of epochs you want to finetune the model. It will be treated as a new experiment, just the model will be loaded from the checkpoint and not initialized from scratch. =)

WingRS commented 3 years ago

Okay, so I've started two experiments:

one with --resume-from - up to 200 epochs (since it has proven to train nicely for 20 epochs, I want to try it out)
another one with --load-weights for another 100 epochs

So far, Ive seen that if I use --load-weights the classification loss goes 2x times how it was initially in the last epoch and the learning rate jumps, even though I've used the --base-learning-rate argument

morkovka1337 commented 3 years ago

the learning rate jumps

What do you mean by "jumps"? What LR policy and warmup do you use?

WingRS commented 3 years ago

What do you mean by "jumps"? What LR policy and warmup do you use? The built-in one in the templates
lr_config = dict(
policy='CosineAnnealing',
min_lr=0.00001,
warmup='linear',
warmup_iters=100,
warmup_ratio=0.1)


The jumps, was a mistake in my understanding. Though the metrics goes down and slowly built-up, as in the screenshot, this is using ``load-weights`` and additional 100 epochs and 48 epochs into training so far
![image](https://user-images.githubusercontent.com/13340448/132300554-8be9e2ed-9fd4-4801-9b0a-399ab5f88031.png)

WingRS commented 3 years ago

Using the step scheduler, that is recommended in the mmdetection documentation for fine-tuning, I've been able to achieve this, but it's only 35 epochs so far. Though the learning rate in my opinion is still high, and there is no way to change it. Maybe you have a suggestion for the learning scheduler?

lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=100,
    warmup_ratio=0.0001,
    step=[7])

morkovka1337 commented 3 years ago

As I can see from the screenshot, you finetune with the --resume-from. I may only suggest using --load-weights instead, this will allow to reduce the LR. You can also add LR drops: step=[7, 14, 21, 28]. This will drop the LR 10x times on the 7th, 14th, 21st and 28th epoch.

WingRS commented 3 years ago

Thanks for the advice. I've read the cyclic learning scheduler, and run an experiment with it and another one with the step scheduler like you suggested. Let's see if it would result in some improvement.

WingRS commented 3 years ago

I've also spotted some incosistency between .log and .log.json in the values: .log

2021-09-08 18:10:01,359 - mmdet - INFO - Epoch [23][70/1682]    lr: 1.652e-05, eta: 2 days, 17:49:40, time: 0.782, data_time: 0.305, memory: 10992, loss_cls: 0.8845, loss_bbox: 0.3341, loss: 1.2186
2021-09-08 18:10:07,434 - mmdet - INFO - Epoch [23][80/1682]    lr: 1.652e-05, eta: 2 days, 17:49:17, time: 0.607, data_time: 0.142, memory: 10992, loss_cls: 0.8879, loss_bbox: 0.3360, loss: 1.2238
2021-09-08 18:10:15,905 - mmdet - INFO - Epoch [23][90/1682]    lr: 1.651e-05, eta: 2 days, 17:49:13, time: 0.847, data_time: 0.394, memory: 10992, loss_cls: 0.8894, loss_bbox: 0.3615, loss: 1.2508
2021-09-08 18:10:22,506 - mmdet - INFO - Epoch [23][100/1682]   lr: 1.651e-05, eta: 2 days, 17:48:55, time: 0.660, data_time: 0.191, memory: 10992, loss_cls: 0.9045, loss_bbox: 0.3421, loss: 1.2466
2021-09-08 18:10:30,199 - mmdet - INFO - Epoch [23][110/1682]   lr: 1.651e-05, eta: 2 days, 17:48:45, time: 0.769, data_time: 0.312, memory: 10992, loss_cls: 0.8511, loss_bbox: 0.2994, loss: 1.1506
2021-09-08 18:10:36,981 - mmdet - INFO - Epoch [23][120/1682]   lr: 1.651e-05, eta: 2 days, 17:48:28, time: 0.678, data_time: 0.214, memory: 10992, loss_cls: 0.8732, loss_bbox: 0.3309, loss: 1.2041
2021-09-08 18:10:45,242 - mmdet - INFO - Epoch [23][130/1682]   lr: 1.651e-05, eta: 2 days, 17:48:23, time: 0.826, data_time: 0.337, memory: 10992, loss_cls: 0.8140, loss_bbox: 0.3023, loss: 1.1163
2021-09-08 18:10:51,895 - mmdet - INFO - Epoch [23][140/1682]   lr: 1.651e-05, eta: 2 days, 17:48:05, time: 0.665, data_time: 0.194, memory: 10992, loss_cls: 0.8054, loss_bbox: 0.2746, loss: 1.0800
2021-09-08 18:11:01,365 - mmdet - INFO - Epoch [23][150/1682]   lr: 1.650e-05, eta: 2 days, 17:48:09, time: 0.947, data_time: 0.476, memory: 10992, loss_cls: 0.8746, loss_bbox: 0.3318, loss: 1.2064

.json

{"mode": "train", "epoch": 23, "iter": 70, "lr": 2e-05, "memory": 10992, "data_time": 0.30503, "loss_cls": 0.88451, "loss_bbox": 0.33411, "loss": 1.21862, "time": 0.7816}
{"mode": "train", "epoch": 23, "iter": 80, "lr": 2e-05, "memory": 10992, "data_time": 0.14151, "loss_cls": 0.88786, "loss_bbox": 0.33597, "loss": 1.22383, "time": 0.60734}
{"mode": "train", "epoch": 23, "iter": 90, "lr": 2e-05, "memory": 10992, "data_time": 0.39376, "loss_cls": 0.88938, "loss_bbox": 0.36147, "loss": 1.25085, "time": 0.84687}
{"mode": "train", "epoch": 23, "iter": 100, "lr": 2e-05, "memory": 10992, "data_time": 0.1913, "loss_cls": 0.90445, "loss_bbox": 0.34213, "loss": 1.24658, "time": 0.66038}
{"mode": "train", "epoch": 23, "iter": 110, "lr": 2e-05, "memory": 10992, "data_time": 0.3117, "loss_cls": 0.85114, "loss_bbox": 0.29945, "loss": 1.15059, "time": 0.76913}
{"mode": "train", "epoch": 23, "iter": 120, "lr": 2e-05, "memory": 10992, "data_time": 0.21424, "loss_cls": 0.87322, "loss_bbox": 0.33092, "loss": 1.20414, "time": 0.67828}
{"mode": "train", "epoch": 23, "iter": 130, "lr": 2e-05, "memory": 10992, "data_time": 0.3365, "loss_cls": 0.81401, "loss_bbox": 0.30232, "loss": 1.11633, "time": 0.82565}
{"mode": "train", "epoch": 23, "iter": 140, "lr": 2e-05, "memory": 10992, "data_time": 0.19364, "loss_cls": 0.80539, "loss_bbox": 0.27459, "loss": 1.07999, "time": 0.66529}
{"mode": "train", "epoch": 23, "iter": 150, "lr": 2e-05, "memory": 10992, "data_time": 0.47623, "loss_cls": 0.8746, "loss_bbox": 0.33185, "loss": 1.20644, "time": 0.94724}

Does the learning rate seems to differ, or I am misinterpreting the data?

morkovka1337 commented 3 years ago

Actually, it does not. Log from json (2e-05) is a rounded value from .log file (1.65e-05). I did not look in detail under the hood, but it seems to be rounding features of the logger.

WingRS commented 3 years ago

Ok, after long training without any significant improvement, I've decided to test bigger input model the 512 one. And it does not even achieve the previous model performance and it gives such a warning during initialization:


size mismatch for bbox_head.cls_convs.0.3.weight: copying a param with shape torch.Size([324, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([40, 96, 1, 1]).
size mismatch for bbox_head.cls_convs.0.3.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([40]).
size mismatch for bbox_head.cls_convs.1.3.weight: copying a param with shape torch.Size([405, 320, 1, 1]) from checkpoint, the shape in current model is torch.Size([50, 320, 1, 1]).
size mismatch for bbox_head.cls_convs.1.3.bias: copying a param with shape torch.Size([405]) from checkpoint, the shape in current model is torch.Size([50]).

WingRS commented 3 years ago

Also, when using the pre-trained weights for 512x512 model input - it goes into overfitting and the mAP gradually goes to zero. When avoiding the pre-trained weights - it trains smoothly: orange is with pre-trained from Github and the blue one without. The data in both the experiments is the same

morkovka1337 commented 3 years ago

Without the details about learning rate value, policy, warmup it is hard to tell what is the reason of such phenomenon. I may only suggest training from scratch if it suits your task well. ;)

WingRS commented 3 years ago

I've already trained without the initial weights. And it works like a charm. The initial setup was the one provided in the repo. The message was more about a possible problem with the initial weights, since they don't match with the architecture, as the PyTorch says.

And a question, the openvino API performs the non-max suppression by itself in the background? If yes, how this is done during the transition from the PyTorch model to IR format?

morkovka1337 commented 3 years ago

The message was more about a possible problem with the initial weights, since they don't match with the architecture, as the PyTorch says.

Again, I cannot say anything concrete without the details about which model and which weights were used. Could you please share the details (config file if standard model was used and which weights do you use, url or something)?

And a question, the openvino API performs the non-max suppression by itself in the background?

Yes, the NMS is supported.

If yes, how this is done during the transition from the PyTorch model to IR format?

Depends on what you mean. If you mean how to convert model, than standard piepline is used (PyTorch -> ONNX -> OpenVINO IR). The export is performed by this script. If you are interested in the details of implementation under the hood, than, you can refer to the openvino main repo: https://github.com/openvinotoolkit/openvino

manisoftwartist commented 3 years ago

2. If use --resume-from, LR is loaded from the checkpoint and cannot be changed.

So do we have a way of inferring this LR from pre-trained snapshot.pth before actually starting the training?
The external/mmdetection/tools/train.py just takes a single parameter resume-from. It does not have a load-weights at all! In this case, how do you process this? I am trying to understand how mmdetection understands the difference between load-weights and resume-from?

morkovka1337 commented 3 years ago

If use --resume-from, LR is loaded from the checkpoint and cannot be changed.

So do we have a way of inferring this LR from pre-trained snapshot.pth before actually starting the training?

The external/mmdetection/tools/train.py just takes a single parameter resume-from. It does not have a load-weights at all! In this case, how do you process this? I am trying to understand how mmdetection understands the difference between load-weights and resume-from?

I don't quite understand the question. If you use resume_from, LR will be loaded from the snapshot and the whole process of the training will be continued from the very place it stopped before. If you mean how can we check what is the value of LR in the snapshot, we can load it using torch's load and look the the dict.
Well, that's true. --load-weights is an parameter for the ote. ote is roughly speaking a front end of the training process, whereas mmdetection is a backend. This means, ote processes parameters and calls mmdetection functions. Under the hood resume-from calls mmcv's resume function, and load-weights (which is transformed to load-from under the hood of ote) just loads checkpoint without loading a state of the optimizer: https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/base_runner.py#L332

So the whole process is the following: ote processes model template and passes parameters to the mmdetection, which uses mmcv's runner for the training process. It is complicated, but it is the price we pay for the modularity and flexibility.

openvinotoolkit / training_extensions

Finetuning a trained model #601