openvinotoolkit / training_extensions

Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
https://openvinotoolkit.github.io/training_extensions/
Apache License 2.0
1.14k stars 442 forks source link

Unable to train Mobilenet-ssd from NNCF Object Detection Examples #249

Closed muhammad-maaz-confiz closed 3 years ago

muhammad-maaz-confiz commented 4 years ago

Dear Team,

I am trying to train Mobilenet-SSD from openvino_training_extensions/pytorch_toolkit/nncf/examples/object_detection/configs/ssd_mobilenet_voc.json. I am getting the below error,

raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128, 1, 1])

The training command is (from directory examples/object_detection)

python main.py -m train --config configs/ssd_mobilenet_voc.json --data data/VOCdevkit --log-dir=results/quantization/

Environment Details:

python=3.7.6
torch=1.2
torchvision=0.4.0

Also in case I have a small GPU like (RTX 2080 Super with 8 GB Memory), it will probably throw a out of memory error (it did when I tried to train ssd300_int8). Is there anyway that I would be able to train on 8 GB memory GPU? Furthermore, is there any pre-trained quantize models available to test the speed on a certain hardware? Thanks

Stay Safe, Maaz

AlexanderDokuchaev commented 4 years ago

@AlexKoff88 @vshampor @ljaljushkin

AlexKoff88 commented 4 years ago

@vshampor, can you please provide a right command-line that we use in our validation?

As for the memory problem, I suggest using a smaller batch size but the accuracy metric (mAP) can be also smaller in this case.

vshampor commented 4 years ago

Greetings, @muhammad-maaz-confiz !

In the examples/object_detection/configs/ssd_mobilenet_voc.json, replace:

"input_info": {
    "sample_size": [1, 3, 300, 300]
  },

with

"input_info": {
    "sample_size": [2, 3, 300, 300]
  },

This shape specification is for the initial dry-run of the model's forward function, which we require in order to build the internal graph representation of the PyTorch model. Since the model has batch-norm layers, even this dry-run has to occur with input tensors having batch dimension larger than 1 - will fix the config in the next release.

You can train the model with a smaller batch size by providing a -b <batch_size> option to the example command line runs, or by specifying a "batch_size": <batch_size> option in the config file - this should enable you to train the model if you have GPU memory constraints.

We only currently provide pre-trained quantized models for SSD-VGG, and these are the only models that we validate, despite that the configs for SSD-MobileNet are also part of the repository. @AlexKoff88, is this something that we should be fixing in the next release?

AlexKoff88 commented 4 years ago

@vshampor, at least it is worth to validate that this config works, without getting the final mAP numbers.

muhammad-maaz-confiz commented 4 years ago

Greetings, @muhammad-maaz-confiz !

In the examples/object_detection/configs/ssd_mobilenet_voc.json, replace:

"input_info": {
    "sample_size": [1, 3, 300, 300]
  },

with

"input_info": {
    "sample_size": [2, 3, 300, 300]
  },

This shape specification is for the initial dry-run of the model's forward function, which we require in order to build the internal graph representation of the PyTorch model. Since the model has batch-norm layers, even this dry-run has to occur with input tensors having batch dimension larger than 1 - will fix the config in the next release.

You can train the model with a smaller batch size by providing a -b <batch_size> option to the example command line runs, or by specifying a "batch_size": <batch_size> option in the config file - this should enable you to train the model if you have GPU memory constraints.

We only currently provide pre-trained quantized models for SSD-VGG, and these are the only models that we validate, despite that the configs for SSD-MobileNet are also part of the repository. @AlexKoff88, is this something that we should be fixing in the next release?

Thank you, the model is now training

mmaaz60 commented 4 years ago

Hi @vshampor,

I just wanted to confirm that the ssd_mobilenet_voc.json is using MobileNetV2 as the backbone with SSD head. Thanks

Maaz