google / aiyprojects-raspbian

API libraries, samples, and system images for AIY Projects (Voice Kit and Vision Kit)
https://aiyprojects.withgoogle.com/
Apache License 2.0
1.63k stars 694 forks source link

MobileNet_V2 with SSDLite on AIY Vision #402

Closed maxritter closed 6 years ago

maxritter commented 6 years ago

I am wondering if it is possible to use the new MobileNet_V2 with SSDLite object detection network on the vision bonnet?

If so, are there any limitations on the input image size or the depth multiplier?

My vision kit arrives next week here in Germany, so I am looking forward to get an answer to this :)

weiran-work commented 6 years ago

@maxritter This is something non-trivial to do. There's a high chance that it will work if you do it right, but we don't have any model that uses such configuration yet.

Existing detector model, Our person cat dog detector is based on MobileNetV1 + SSD. 256x256 with depthwise multiplier 0.125. chadwallacehart@ wrote a guide on retraining a customized detector based on this config. See issue#314 for more details (it's a long thread, focus on the last few comments). It should be pretty straight forward if you follow this config.

To achieve what you asked.

MobileNet V2 We released iNaturalist data set based model (https://aiyprojects.withgoogle.com/model/nature-explorer/), those models are based on MobileNet V2, with configuration 192x192, depthwise multiplier 1.0. I'd expect 256x256 with depthwise multiplier 0.125 to fit within resource budget.

training script To retrain MobileNetV1+SSD, you'd need to stick with this script. This script is written specifically for VisionKit. The main difference is the conv kernel sizes for the SSD head. For a regular MobileNet+SSD, there are 6 layers for 'ssd_anchor_generator'. This causes some Conv ops to have input size (heightxwidth) smaller than kernel size. TF allows such operation, but VisionKit cannot run such operation.

SSDLite Main difference versus SSD is the use of depthwise conv instead of conv, I don't think this will give trouble. But still, input size can NOT be smaller than kernel size for depthwise convolution.

You can follow this example (MobileNet+SSD+feature extractor) to figure out how to configure your feature extractor for MobileNetV2+SSDLite. https://github.com/tensorflow/models/blob/master/research/object_detection/models/embedded_ssd_mobilenet_v1_feature_extractor.py

Note, there's no pretrained checkpoint, so you'd have to start training from scratch.

We will release some script to help checking whether a compiled a model can run on vision kit or not. Stay tuned.

Let me know if you run into issues.

maxritter commented 6 years ago

Thanks for your detailed answer, I highly appreciate it!

I retrained my model on the describeded embedded MobileNetV1+SSD version, and will check tomorrow when my board arrives if it works there. The results on my desktop PC seem to be good enough, so I do not need to change to Mobilenet_V2 + SSDLite that soon.

Looking forward to all the new updates for the vision kit :)