opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
75.95k stars 55.62k forks source link

Enable Eltwise layer with different numbers of inputs channels #15739

Closed dkurt closed 4 years ago

dkurt commented 4 years ago

This pullrequest changes

resolves https://github.com/opencv/opencv/issues/15724

This PR enables Eltwise layer (sum, prod, max) with input tensors with different number of channels.

Merge with extra: https://github.com/opencv/opencv_extra/pull/679

Median (ms)

                       Name of Test                          3.4     pr        pr    
                                                                               vs    
                                                                              3.4    
                                                                           (x-factor)
AlexNet::DNNTestNetwork::OCV/CPU                           14.343  14.430     0.99   
DenseNet_121::DNNTestNetwork::OCV/CPU                      37.195  37.235     1.00   
EAST_text_detection::DNNTestNetwork::OCV/CPU               67.625  67.560     1.00   
ENet::DNNTestNetwork::OCV/CPU                              44.997  45.185     1.00   
FastNeuralStyle_eccv16::DNNTestNetwork::OCV/CPU            115.694 116.573    0.99   
GoogLeNet::DNNTestNetwork::OCV/CPU                         15.383  15.265     1.01   
Inception_5h::DNNTestNetwork::OCV/CPU                      16.909  16.771     1.01   
Inception_v2_Faster_RCNN::DNNTestNetwork::OCV/CPU          286.174 287.129    1.00   
Inception_v2_SSD_TensorFlow::DNNTestNetwork::OCV/CPU       42.374  42.402     1.00   
MobileNet_SSD_Caffe::DNNTestNetwork::OCV/CPU               19.031  19.038     1.00   
MobileNet_SSD_v1_TensorFlow::DNNTestNetwork::OCV/CPU       20.979  20.971     1.00   
MobileNet_SSD_v2_TensorFlow::DNNTestNetwork::OCV/CPU       29.889  29.921     1.00   
OpenFace::DNNTestNetwork::OCV/CPU                           3.887   3.916     0.99   
OpenPose_pose_mpi_faster_4_stages::DNNTestNetwork::OCV/CPU 602.822 605.825    1.00   
ResNet_50::DNNTestNetwork::OCV/CPU                         35.661  35.846     0.99   
SSD::DNNTestNetwork::OCV/CPU                               268.011 267.123    1.00   
SqueezeNet_v1_1::DNNTestNetwork::OCV/CPU                    3.860   3.913     0.99   
YOLOv3::DNNTestNetwork::OCV/CPU                            214.246 213.658    1.00   
opencv_face_detector::DNNTestNetwork::OCV/CPU              13.413  13.498     0.99   
force_builders=Custom,Custom Win,Custom Mac
build_image:Custom=ubuntu-openvino-2019r3.0:16.04
build_image:Custom Win=openvino-2019r3.0
build_image:Custom Mac=openvino-2019r3.0

test_modules:Custom=dnn,python2,python3,java
test_modules:Custom Win=dnn,python2,python3,java
test_modules:Custom Mac=dnn,python2,python3,java

buildworker:Custom=linux-1
# disabled due high memory usage: test_opencl:Custom=ON
test_opencl:Custom=OFF
test_bigdata:Custom=1
test_filter:Custom=*
allow_multiple_commits=1
Arcitec commented 4 years ago

EDIT: HEY, THIS HAS BEEN MERGED! For anyone who needs this new feature right now on Python and can't wait 3 months for the next OpenCV version, read the following post:

Build guide: https://github.com/opencv/opencv/pull/15739#issuecomment-546071628

Also check https://github.com/opencv/opencv/pull/15739#issuecomment-544931445 if you wanna see the amazing benchmarks of this new network!


Wow, I've done a small code review and everything looks excellent. Nice refactoring of activation handling, and the alpha parameter is nicely handled via blending coefficients! Great job!

Although I don't understand about half of the eltwise code. Is it doing a resize to make the layers match each other's size (that's what darknet does (via the copy_cpu() line if sizes mismatch), or is it doing a lookup as in "read pixel 1 from source pixel 1, read pixel 2 from source pixel 1" etc? I see the loops and the sorting by channels and then it seems to be doing the latter. But like I said, I don't understand it. Perhaps it is doing an interpolated resize somewhere!

I am also unsure where the coeffs are doing the alpha blending (multiplication). Unless this is the relevant line: https://github.com/opencv/opencv/pull/15739/commits/adbd6136604c5b9fed570f8a294f83dbfd5aa2e7#diff-7ac73ff12c29882cb913b6f09da2f82cR258

As for testing (compiling), I'll try now! But I haven't compiled OpenCV before. I use it via Python.

Thank you so much for everything @dkurt!

Arcitec commented 4 years ago

Goddamn that was hard to compile. Was following https://docs.opencv.org/3.4/d3/d52/tutorial_windows_install.html which is severely outdated and required researching many changes. I split that into two files.

installocv1.sh:

#!/bin/bash -e
myRepo=$(pwd)
if [  ! -d "$myRepo/opencv"  ]; then
    echo "cloning opencv"
    git clone https://github.com/opencv/opencv.git
    mkdir Build
    mkdir Build/opencv
    mkdir Install
    mkdir Install/opencv
else
    cd opencv
    git pull --rebase
    cd ..
fi
if [  ! -d "$myRepo/opencv_contrib"  ]; then
    echo "cloning opencv_contrib"
    git clone https://github.com/opencv/opencv_contrib.git
    mkdir Build
    mkdir Build/opencv_contrib
else
    cd opencv_contrib
    git pull --rebase
    cd ..
fi

Then I entered the opencv folder and git checkout 3.4, and applied the https://github.com/opencv/opencv/pull/15739.patch patch.

Next, I ran my modified second file:

installocv2.sh:

#!/bin/bash -e
myRepo=$(pwd)
CMAKE_CONFIG_GENERATOR="Visual Studio 16 2019"
CMAKE_CONFIG_ARCH="x64"
RepoSource=opencv
pushd Build/$RepoSource
CMAKE_OPTIONS='-DBUILD_PERF_TESTS:BOOL=OFF -DBUILD_TESTS:BOOL=OFF -DBUILD_DOCS:BOOL=OFF  -DWITH_CUDA:BOOL=OFF -DBUILD_EXAMPLES:BOOL=OFF -DINSTALL_CREATE_DISTRIB=ON'
cmake -G"$CMAKE_CONFIG_GENERATOR" -A"$CMAKE_CONFIG_ARCH" $CMAKE_OPTIONS -DOPENCV_EXTRA_MODULES_PATH="$myRepo"/opencv_contrib/modules -DCMAKE_INSTALL_PREFIX="$myRepo"/install/"$RepoSource" "$myRepo/$RepoSource"
echo "************************* $Source_DIR -->debug"
cmake --build .  --config debug
echo "************************* $Source_DIR -->release"
cmake --build .  --config release
cmake --build .  --target install --config release
cmake --build .  --target install --config debug
popd

It's compiling now. Going for a coffee, then I'll try to get the OpenCV C++ interface working and will be trying YOLOv3-Tiny-PRN!

dkurt commented 4 years ago

You may use /m:4 for multithreading build to speed it up.

Eltwise layer is an element wise summation or product or maximum. In case of differrent number of channels, it sums only the shared channels:

5 3 2
* * *
* * *
* *
*
*

Here is three inputs with 5, 3 and 2 channels correspondingly.

Arcitec commented 4 years ago

@dkurt Hey :-) The build finished a few minutes ago and I just figured out how to include it in a new project. I was just about to look up how to load and run DNN's using the C++ Interface!

Regarding your eltwise description: I still don't understand. Darknet resizes the input to the same size as the output before summing different-size layers. Is this patch behaving the same way? That's all that matters. :-)

Alright I'm gonna be ready with test results pretty soon...

dkurt commented 4 years ago

I'll mark the PR as work in progress this way. Need to check Darknet's behavior once again. Thanks!

Arcitec commented 4 years ago

@dkurt Okay so after getting stuck using the C++ API directly (I loaded the net and ran forward pass, but realized it's a lot of work to actually draw the outputs), I instead compiled the https://github.com/opencv/opencv/blob/master/samples/dnn/object_detection.cpp example detector, which I didn't even realize existed. It would have saved me half an hour if I knew about that! :-D

Then I ran with --async=0 --backend=3 --config="net\yolov3-tiny-prn.cfg" --model="net\yolov3-tiny-prn.weights" --classes="net\coco.names" --height=416 --width=416 --mean="0 0 0" --rgb --scale=0.003921568627451 --target=0 --thr=0.3 --input="traffic_jam.jpg" (The scale is the value of 1 / 255, which is what darknet expects.)

And I see that it does detect objects and adds some bounding boxes. So that's a good sign. The net/weights are the official files directly from the YOLOv3-Tiny-PRN researchers. And coco.names is standard. :-)

https://github.com/AlexeyAB/darknet/blob/master/cfg/coco.names https://github.com/WongKinYiu/PartialResidualNetworks/blob/master/cfg/yolov3-tiny-prn.cfg https://github.com/WongKinYiu/PartialResidualNetworks/blob/master/model/yolov3-tiny-prn.weights

Regarding the darknet behavior, I linked here https://github.com/opencv/opencv/pull/15739#issuecomment-543835051 to the source file where you see "if input and output of shortcut layer are different size, do copy_cpu". So that function will tell you what darknet does. I am guessing linear scaling (interpolation).

Anyway I am available for more testing, now that my C++ build environment and test code are all complete!

Edit: Turns out the PRN network is not behaving properly. See answers below.

alalek commented 4 years ago

-DBUILD_EXAMPLES:BOOL=OFF

BTW, It is better to turn ON this option and build required sample project directly (it would appear in OpenCV.sln projects list).

Arcitec commented 4 years ago

Ah okay, yeah I saw that flag and thought I should probably have set that. It would have saved some time getting me started! ;-)

Okay I am seeing some misbehaving in OpenCV.

Folder img contains:

croppedcars.jpg

croppedcars

traffic_jam.jpg

traffic_jam

Folder net contains:

coco.darknet.data

classes= 80
names = net/coco.names

coco.names (from URL in previous post)

yolov3-tiny-prn.cfg (from URL in previous post)

yolov3-tiny-prn.weights (from URL in previous post)

Script darknet.cmd contains:

@echo off
C:\darknet\build\darknet\x64\darknet.exe detector test -thresh 0.1 "net\coco.darknet.data" "net\yolov3-tiny-prn.cfg" "net\yolov3-tiny-prn.weights" "%1"

Script opencv.cmd contains:

@echo off
TestOpenCV.exe --async=0 --backend=3 --config="net\yolov3-tiny-prn.cfg" --model="net\yolov3-tiny-prn.weights" --classes="net\coco.names" --height=416 --width=416 --mean="0 0 0" --rgb --scale=0.003921568627451 --target=0 --thr=0.1 --input="%1"

(TestOpenCV is just the name of my binary of your official DNN example file.)

Output from running darknet.cmd "img\croppedcars.jpg":

img\croppedcars.jpg: Predicted in 9.006000 milli-seconds.
car: 97%
car: 16%
truck: 10%
car: 89%
truck: 30%
car: 84%
car: 88%
car: 11%
car: 53%
car: 12%
car: 87%
truck: 14%
car: 66%
truck: 19%
car: 34%
truck: 10%
car: 19%
car: 29%
car: 15%
car: 43%
car: 28%
car: 75%
car: 30%
truck: 54%
car: 23%
suitcase: 19%
car: 35%
truck: 12%
car: 30%
truck: 21%
car: 17%
truck: 10%
car: 52%
car: 89%
car: 41%
car: 13%
car: 75%
car: 14%
car: 51%
car: 17%
car: 13%

darknet10prn

Output from running opencv.cmd "img\croppedcars.jpg":

opencv

Both of those configs use a 10% confidence threshold.

Here's Darknet with its default threshold (not sure what its default is, but seems to be 20%+). Those were the darknet results before I edited darknet.cmd to force 10% threshold:

darknet

Arcitec commented 4 years ago

I thought I may have made a mistake with the threshold, but nope, I didn't...

I've updated the post above to clarify that Darknet uses threshold 10% too.

Both tests above use 10% threshold.

Arcitec commented 4 years ago

More tests:

OpenCV with 20% threshold = Looks the same as the 10% image above (same result, a whole image filled with "person"-detections).

OpenCV with 30% threshold = Much less detections than darknet, clearly not behaving properly (few detections, some wrong labels, weird confidences):

opencv30

Any ideas why the net is misbehaving? Possibly due to the shortcut layer resize technique being different?

Arcitec commented 4 years ago

Here is an OpenCV test comparison with regular YOLOv3-Tiny (not the shortcut-based PRN version), just to see that OpenCV properly handles the regular net!

net\yolov3-tiny.cfg

From https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-tiny.cfg BUT YOU MUST EDIT THE mask = 0,1,2 LINE TO SAY mask = 1,2,3, this is because the official pjreddie weights were trained with the bugged 123 mask, and later he (pjreddie) fixed the config (changed to 012) without re-training the weights, which leads to bad bounding boxes if we don't restore the bugged mask (we talked a lot about it here: https://github.com/WongKinYiu/PartialResidualNetworks/issues/2).

net\yolov3-tiny.weights

From https://pjreddie.com/media/files/yolov3-tiny.weights (as described at https://github.com/AlexeyAB/darknet#pre-trained-models regarding "yolov3-tiny.cfg")

opencv-tiny.cmd

@echo off
TestOpenCV.exe --async=0 --backend=3 --config="net\yolov3-tiny.cfg" --model="net\yolov3-tiny.weights" --classes="net\coco.names" --height=416 --width=416 --mean="0 0 0" --rgb --scale=0.003921568627451 --target=0 --thr=0.1 --input="%1"

darknet-tiny.cmd

@echo off
C:\darknet\build\darknet\x64\darknet.exe detector test -thresh 0.1 "net\coco.darknet.data" "net\yolov3-tiny.cfg" "net\yolov3-tiny.weights" "%1"

OpenCV results at 10% threshold:

lite10opencv

And here's the YOLOv3-Tiny model result at 10% threshold in darknet instead:

darknet10lite

So the regular YOLOv3-Tiny behaves properly in OpenCV (the small differences seems to be from NMS suppression differences in OpenCV and darknet).

Only the YOLOv3-Tiny-PRN model fails. This probably means that shortcut layer is not correct in OpenCV (most likely answer, since the PRN version only changes filter-amount, plus adds shortcut layers).

WongKinYiu commented 4 years ago

I am not sure I understood the code correctly or not, the channel_of_output seems will be the maximal value of channel_of_input and channel_of_from. (If I understand wrong, please ignore this comment.)

        int dims = inputs[0].size();
        int numChannels = inputs[0][1];
        for (int i = 1; i < inputs.size(); i++)
        {
            CV_Assert(inputs[0][0] == inputs[i][0]);
            numChannels = std::max(numChannels, inputs[i][1]);

            // It's allowed for channels axis to be different.
            for (int j = 2; j < dims; j++)
                CV_Assert(inputs[0][j] == inputs[i][j]);
        }

        outputs.assign(1, inputs[0]);
        outputs[0][1] = numChannels;

However, in the implementation of darknet, the channel_of_output will be channel_of_input, as shown in following figure. image

1st step: copy input to output:

        copy_cpu(l.outputs*l.batch, state.input, 1, l.output, 1);

2nd step: add from to output:

        shortcut_cpu(l.batch, l.w, l.h, l.c, state.net.layers[l.index].output, l.out_w, l.out_h, l.out_c, l.output);

what shorcut_cpu does:

{
    int stride = w1/w2;
    int sample = w2/w1;
    assert(stride == h1/h2);
    assert(sample == h2/h1);
    if(stride < 1) stride = 1;
    if(sample < 1) sample = 1;
    int minw = (w1 < w2) ? w1 : w2;
    int minh = (h1 < h2) ? h1 : h2;
    int minc = (c1 < c2) ? c1 : c2;

    int i,j,k,b;
    for(b = 0; b < batch; ++b){
        for(k = 0; k < minc; ++k){
            for(j = 0; j < minh; ++j){
                for(i = 0; i < minw; ++i){
                    int out_index = i*sample + w2*(j*sample + h2*(k + c2*b));
                    int add_index = i*stride + w1*(j*stride + h1*(k + c1*b));
                    out[out_index] += add[add_index];
                }
            }
        }
    }
}
Arcitec commented 4 years ago

@WongKinYiu Thank you so much for helping with the research! <3

Good idea to bring all the code in here for easy overview!

Here's the parse_shortcut config parser (ancient) from pjreddie:

https://github.com/pjreddie/darknet/blob/b13f67bfdd87434e141af532cdb5dc1b8369aa3b/src/parser.c#L539-L556

layer parse_shortcut(list *options, size_params params, network *net)
{
    char *l = option_find(options, "from");
    int index = atoi(l);
    if(index < 0) index = params.index + index;

    int batch = params.batch;
    layer from = net->layers[index];

    layer s = make_shortcut_layer(batch, index, params.w, params.h, params.c, from.out_w, from.out_h, from.out_c);

    char *activation_s = option_find_str(options, "activation", "linear");
    ACTIVATION activation = get_activation(activation_s);
    s.activation = activation;
    s.alpha = option_find_float_quiet(options, "alpha", 1);
    s.beta = option_find_float_quiet(options, "beta", 1);
    return s;
}

Here's the parse_shortcut config parser (modern) from AlexeyAB:

https://github.com/AlexeyAB/darknet/blob/1c71f001531a5df0637903117c6568725d7a66b3/src/parser.c#L602-L619

layer parse_shortcut(list *options, size_params params, network net)
{
    int assisted_excitation = option_find_float_quiet(options, "assisted_excitation", 0);
    char *l = option_find(options, "from");
    int index = atoi(l);
    if(index < 0) index = params.index + index;

    int batch = params.batch;
    layer from = net.layers[index];
    if (from.antialiasing) from = *from.input_layer;

    layer s = make_shortcut_layer(batch, index, params.w, params.h, params.c, from.out_w, from.out_h, from.out_c, assisted_excitation);

    char *activation_s = option_find_str(options, "activation", "linear");
    ACTIVATION activation = get_activation(activation_s);
    s.activation = activation;
    return s;
}

(As you can see above, @AlexeyAB has removed the alpha and beta parameters, which may mean that he has moved them somewhere else... but I am not sure... maybe he really did remove them. The only parser.c reference I can find to "alpha" now is in parse_normalization which makes a normalization layer at https://github.com/AlexeyAB/darknet/blob/1c71f001531a5df0637903117c6568725d7a66b3/src/parser.c#L586-L594)

Here's the make_shortcut_layer function which sets up the shortcut layer when building the graph:

https://github.com/AlexeyAB/darknet/blob/f6fa4a56d938f4f8c69774d3622e768e7411507d/src/shortcut_layer.c#L8-L49

layer make_shortcut_layer(int batch, int index, int w, int h, int c, int w2, int h2, int c2, int assisted_excitation)
{
    if(assisted_excitation) fprintf(stderr, "Shortcut Layer - AE: %d\n", index);
    else fprintf(stderr,"Shortcut Layer: %d\n", index);
    layer l = { (LAYER_TYPE)0 };
    l.type = SHORTCUT;
    l.batch = batch;
    l.w = w2;
    l.h = h2;
    l.c = c2;
    l.out_w = w;
    l.out_h = h;
    l.out_c = c;
    l.outputs = w*h*c;
    l.inputs = l.outputs;

    l.assisted_excitation = assisted_excitation;

    if(w != w2 || h != h2 || c != c2) fprintf(stderr, " w = %d, w2 = %d, h = %d, h2 = %d, c = %d, c2 = %d \n", w, w2, h, h2, c, c2);

    l.index = index;

    l.delta = (float*)calloc(l.outputs * batch, sizeof(float));
    l.output = (float*)calloc(l.outputs * batch, sizeof(float));

    l.forward = forward_shortcut_layer;
    l.backward = backward_shortcut_layer;
#ifdef GPU
    l.forward_gpu = forward_shortcut_layer_gpu;
    l.backward_gpu = backward_shortcut_layer_gpu;

    l.delta_gpu =  cuda_make_array(l.delta, l.outputs*batch);
    l.output_gpu = cuda_make_array(l.output, l.outputs*batch);
    if (l.assisted_excitation)
    {
        const int size = l.out_w * l.out_h * l.batch;
        l.gt_gpu = cuda_make_array(NULL, size);
        l.a_avg_gpu = cuda_make_array(NULL, size);
    }
#endif  // GPU
    return l;
}

Here's forward_shortcut_layer and backward_shortcut_layer:

https://github.com/AlexeyAB/darknet/blob/f6fa4a56d938f4f8c69774d3622e768e7411507d/src/shortcut_layer.c#L71-L94

void forward_shortcut_layer(const layer l, network_state state)
{
    if (l.w == l.out_w && l.h == l.out_h && l.c == l.out_c) {
        int size = l.batch * l.w * l.h * l.c;
        int i;
        #pragma omp parallel for
        for(i = 0; i < size; ++i)
            l.output[i] = state.input[i] + state.net.layers[l.index].output[i];
    }
    else {
        copy_cpu(l.outputs*l.batch, state.input, 1, l.output, 1);
        shortcut_cpu(l.batch, l.w, l.h, l.c, state.net.layers[l.index].output, l.out_w, l.out_h, l.out_c, l.output);
    }
    activate_array(l.output, l.outputs*l.batch, l.activation);

    if (l.assisted_excitation && state.train) assisted_excitation_forward(l, state);
}

void backward_shortcut_layer(const layer l, network_state state)
{
    gradient_array(l.output, l.outputs*l.batch, l.activation, l.delta);
    axpy_cpu(l.outputs*l.batch, 1, l.delta, 1, state.delta, 1);
    shortcut_cpu(l.batch, l.out_w, l.out_h, l.out_c, l.delta, l.w, l.h, l.c, state.net.layers[l.index].delta);
}

Here's copy_cpu:

https://github.com/AlexeyAB/darknet/blob/eac26226a7fc0a9da2b684a564f8f086eaf38390/src/blas.c#L219-L223

void copy_cpu(int N, float *X, int INCX, float *Y, int INCY)
{
    int i;
    for(i = 0; i < N; ++i) Y[i*INCY] = X[i*INCX];
}

(X variable is the input layer's data array, Y is the output layer's data array, INCX and INCY control "skip items" during the copy (they are both set to 1 meaning "don't skip anything" in the call), N is how many array entries to copy to the output.)

Here's shortcut_cpu:

https://github.com/AlexeyAB/darknet/blob/eac26226a7fc0a9da2b684a564f8f086eaf38390/src/blas.c#L71-L95

void shortcut_cpu(int batch, int w1, int h1, int c1, float *add, int w2, int h2, int c2, float *out)
{
    int stride = w1/w2;
    int sample = w2/w1;
    assert(stride == h1/h2);
    assert(sample == h2/h1);
    if(stride < 1) stride = 1;
    if(sample < 1) sample = 1;
    int minw = (w1 < w2) ? w1 : w2;
    int minh = (h1 < h2) ? h1 : h2;
    int minc = (c1 < c2) ? c1 : c2;

    int i,j,k,b;
    for(b = 0; b < batch; ++b){
        for(k = 0; k < minc; ++k){
            for(j = 0; j < minh; ++j){
                for(i = 0; i < minw; ++i){
                    int out_index = i*sample + w2*(j*sample + h2*(k + c2*b));
                    int add_index = i*stride + w1*(j*stride + h1*(k + c1*b));
                    out[out_index] += add[add_index];
                }
            }
        }
    }
}

What makes it hard to decipher the code is that darknet was pretty poorly written in my opinion. The inconsistent variable names and general structure is very messy. And the lack of comments is extreme. I mean, a quick // Does a nearest-neighbor copy from input to output whenever the layers are of different size. would have made Darknet a lot more maintainable. Hehe.

I'm reading the shortcut_cpu function above and it seems to calculate a "which pixel to sample" value but it does those calculations as ints, which means that I am guessing that whenever it "shortcuts" between different-size layers, it just does an extremely jagged "nearest neighbor" scaling (no smooth interpolation at all).

Edit: Yeah looks like a nearest neighbor algorithm, if you compare the code above to this: http://tech-algorithm.com/articles/nearest-neighbor-image-scaling/

WongKinYiu commented 4 years ago

@VideoPlayerCode Hello,

Could you help for checking the num_of_channel after Eltwise layer?

Yes, it just does an extremely jagged "nearest neighbor" scaling (no smooth interpolation at all).

Arcitec commented 4 years ago

@WongKinYiu Hi again. :-) I was going to sleep now. But which file do you see that in? I just did a search in all opencv and darknet source code and nothing is named num_of_channels.

Edit: Perhaps you mean numChannels in the code you quoted? Anyway I'm going to sleep for today. Goodnight. ^_^

WongKinYiu commented 4 years ago

@VideoPlayerCode good night.

what I mean is the size of inputs and output of shourtcut layer of opencv dnn.

Arcitec commented 4 years ago

@WongKinYiu Ah, yes I agree that it looks like Eltwise is selecting the largest count from its input and output. @dkurt will know what it does.

Btw where is that graphic from? https://user-images.githubusercontent.com/12152972/67135742-86d5af80-f24f-11e9-8b64-5892caf77532.png (it's not in the v1/v2/v3 YOLO papers).

I'll be back tomorrow to help! Goodnight. :-)

WongKinYiu commented 4 years ago

@VideoPlayerCode Oh, I drew the figure one hour ago.

Arcitec commented 4 years ago

@dkurt Hi, just fyi I am here now and going to test your new change!

I also noticed the coeffs fix. I didn't notice that problem the first time. It stores alpha in coeffs[0], and then uses coeffs[0] to multiply (previously it used coeffs[1]), so I am glad the alpha fix was discovered! Great job!

Okay, time to recompile and test the net again.

Arcitec commented 4 years ago

@dkurt

Test results are here for patch v2! <3 Thank you so much for doing incredible work.

All images are made with Confidence Threshold: 10%+. Images below are provided as original, side by side scaled, and individual scaled, to let us compare bounding boxes! The most useful method is to open the individual large images in two web tabs and switch between them to check box similarity!

NMS is not configurable in darknet's command line from what I can see, so I guess the slight box differences are due to lack of / low amount of NMS in Darknet.

The ONLY thing I am not sure about is why Darknet sees a bus to the left and OpenCV doesn't.

v2sidebyside

v2sidebyside-scaled

v2opencv

v2darknet

PS: The "Truck: 0.54" label in OpenCV is consistent with the "truck: 54%" console output from Darknet, so yeah that is a genuine misdetection from the net, and isn't a problem with Darknet/OpenCV.

Arcitec commented 4 years ago

I've figured out why there are some overlapping (extra) boxes in Darknet: They're different color! They're a different class!

Darknet:

OpenCV:

That explains a lot.. but not everything! There's also TONS of smaller boxes in the Darknet image that overlap or sit mostly within larger boxes of the exact same object class. Ie "a 100x100 box of Car (yellow) containing a 30x30 box of Car (yellow)". Here are my theories on that:

My theory is the last one: That OpenCV has better NMS processing and filters out identical-class objects that sit inside larger boxes.

So... what can we conclude? @WongKinYiu @dkurt, if I am correct, OpenCV is now calculating the exact correct neural network output, and the only difference is in box-postprocessing in Darknet vs OpenCV. What do you think?

If I am right about differences in NMS-processing, then the only remaining question is why Darknet sees a car to the left (cropped) and OpenCV doesn't. Even if I set OpenCV to 1% threshold, it doesn't detect a box to the left. This one:

whatisthis

WongKinYiu commented 4 years ago

@VideoPlayerCode Hello,

i can not find the reason for problem in the image since i do not know implementation details of opencv dnn. i have met this kind of problem when translate pytorch or caffe model to darknet, their implementation of stride of max-pooling layer has a little bit difference.

could you help me for filling this table? thanks a lot.

CPU: (ur cpu) DarkNet OpenCV
yolo-v3-tiny (fps) 30
yolo-v3-tiny-prn (fps) 43
Arcitec commented 4 years ago

@WongKinYiu I was about to head to sleep now but I don't wanna make you wait long for a result. ;-)

I took the OpenCV object detector and modified it at this line for benchmarking instead:

https://github.com/opencv/opencv/blob/master/samples/dnn/object_detection.cpp#L215-L224

Sloppily modified to run the forward pass 400 times, each time asking the network (OpenCV DNN) how long it took, and then I calculate the average:

                                if (async)
                {
                    futureOutputs.push(net.forwardAsync());
                }
                else
                {
                    double totalTime = 0.0;
                    int benchCount = 400;
                    for (int xyz = 0; xyz <= benchCount; ++xyz) {
                        std::vector<Mat> outs;
                        net.forward(outs, outNames);

                        if (xyz > 0) { // we ignore 0th "warmup" run since slow network setup happens on that run
                            std::vector<double> layersTimings;
                            double freq = cv::getTickFrequency() / 1000;
                            double time = net.getPerfProfile(layersTimings) / freq;
                            totalTime += time;
                            std::cout << "Time: " << time << " ms" << std::endl;

                            if (xyz == benchCount) {
                                std::cout << "Total Time: " << totalTime << " ms for " << benchCount << " runs" << std::endl;
                                std::cout << "Average Time: " << (totalTime / (double)benchCount) << " ms" << std::endl;

                                predictionsQueue.push(outs);
                            }
                        }
                    }
                }

Result YOLOv3-Tiny:

Time: 28.6564 ms
Time: 28.64 ms
Time: 28.6805 ms
Time: 27.8658 ms
Time: 28.2653 ms
Time: 27.4486 ms
Time: 28.1919 ms
Time: 28.0834 ms
Time: 27.6298 ms
Time: 27.9133 ms
Time: 27.3257 ms
Time: 28.398 ms
Time: 27.5317 ms
Time: 28.3219 ms
Time: 29.3646 ms
Time: 29.4685 ms
Time: 29.3866 ms
Time: 28.2108 ms
Time: 28.0706 ms
Time: 28.9971 ms
[..cut..]
Time: 27.3292 ms
Time: 27.5691 ms
Time: 28.031 ms
Time: 27.5387 ms
Time: 26.9302 ms
Time: 27.3555 ms
Time: 28.0565 ms
Time: 27.6875 ms
Time: 28.7753 ms
Time: 27.2354 ms
Time: 27.3248 ms
Time: 28.3666 ms
Time: 27.3292 ms
Time: 27.8084 ms
Time: 27.4206 ms
Time: 27.9692 ms
Time: 28.6945 ms
Time: 27.6334 ms
Time: 28.4752 ms
Time: 27.6757 ms
Time: 28.8734 ms
Total Time: 11551.6 ms for 400 runs
Average Time: 28.879 ms

Result YOLOv3-Tiny-PRN:

Time: 20.3881 ms
Time: 22.5836 ms
Time: 20.5443 ms
Time: 20.308 ms
Time: 20.4397 ms
Time: 21.0414 ms
Time: 20.6547 ms
Time: 20.1291 ms
Time: 21.1341 ms
Time: 20.3625 ms
Time: 21.9148 ms
Time: 21.1358 ms
Time: 20.8719 ms
Time: 20.4887 ms
Time: 22.701 ms
Time: 21.0417 ms
Time: 21.8145 ms
Time: 20.6186 ms
Time: 20.4327 ms
Time: 20.4334 ms
[..cut..]
Time: 21.6263 ms
Time: 22.1348 ms
Time: 21.3634 ms
Time: 21.7802 ms
Time: 19.8965 ms
Time: 20.6174 ms
Time: 20.7663 ms
Time: 20.4019 ms
Time: 20.8847 ms
Time: 21.784 ms
Time: 20.3758 ms
Time: 21.5337 ms
Time: 20.6976 ms
Time: 20.7982 ms
Time: 21.0949 ms
Time: 20.707 ms
Time: 20.8154 ms
Time: 20.537 ms
Time: 20.2921 ms
Time: 21.6338 ms
Time: 20.8722 ms
Total Time: 8239.49 ms for 400 runs
Average Time: 20.5987 ms

Summary of Results:

CPU: Intel Core i7-8750H, laptop CPU (in max performance mode).
Frame: 416x416 RGB (the traffic image I have shown above).

Both nets are trained on COCO. I am using their pretrained weights.
All configs visible earlier in this discussion.

yolov3-tiny:
  Total Time: 11551.6 ms for 400 runs
  Average Time: 28.879 ms (140.20% of runtime of yolov3-tiny-prn)
  Frames Per Second: 34.63
  Takes +8.2803 extra milliseconds per run than yolov3-tiny-prn.

yolov3-tiny-prn:
  Total Time: 8239.49 ms for 400 runs
  Average Time: 20.5987 ms (71.33% of runtime of yolov3-tiny)
  Frames Per Second: 48.55 (40.2% more FPS than yolov3-tiny)

Now let's remember your CPU speed ratio at Darknet: https://github.com/opencv/opencv/issues/15724

Darknet CPU numbers by @WongKinYiu:
125ms (yolov3-tiny, 160.3% of runtime of yolov3-tiny-prn)
78ms (yolov3-tiny-prn, 62.4% of runtime of yolov3-tiny).

So, in conclusion: AWESOME! We got almost the same relative "PRN speedup" on OpenCV (71.33%) as what these optimizations gave on Darknet (62.4%)! And the fact is that Darknet's CPU code is terrible so it doesn't surprise me that it got a slightly better relative improvement by the PRN network since Darknet is so inefficient at everything, so the lowered layer complexity will have a big effect on Darknet. Whereas OpenCV is super efficient and well coded.

Either way, OpenCV got a HUGE improvement too! This is awesome! Thank you so much @WongKinYiu for designing this network and @dkurt for your amazing work implementing the necessary math!

Now you can see why I was so excited about this network! It's giving 40.2% more FPS than YOLOv3-Tiny, and extremely similar detection accuracy. Mindblowing.

And @WongKinYiu if you want me to benchmark via Darknet on this machine, I'd need to know how to do that. Hopefully the answer isn't "use the Darknet library in a C program and time it yourself". If so, do you have any code for that? I don't feel like learning the Darknet C interface. ^_^

Alright world, goodnight for today! :-)

Arcitec commented 4 years ago

By the way all those results are with the full COCO-trained (80 classes) models.

On my existing 1-class YOLOv3-Tiny model, OpenCV takes an average of 24.8562 ms (40.23 FPS) in this benchmark. The 80-class model of YOLOv3-Tiny (which averaged 28.879 ms) is therefore 16.18% slower.

I don't have any YOLOv3-Tiny-PRN 1-class model yet, but if that ratio is still true (and I think it will be), then we can expect to see a 1-class PRN model taking 17.7293 ms (56.40 FPS).

In other words:

yolov3-tiny, 80 classes (COCO): ~28.879 ms (34.63 FPS)
yolov3-tiny-prn, 80 classes (COCO): ~20.5987 ms (48.55 FPS)
yolov3-tiny, 1 class (my own): ~24.8562 ms (40.23 FPS)
yolov3-tiny-prn, 1 class (my own): ~17.7293 ms (56.40 FPS) <-- GUESS

I will be training a 1-class PRN version, probably tomorrow. To get a real 1-class test for PRN to replace the "GUESS". :-)

Alright, I'm off for today! 😴

dkurt commented 4 years ago

Thank you @VideoPlayerCode and @WongKinYiu! @WongKinYiu, thanks to the scheme from https://github.com/opencv/opencv/pull/15739#issuecomment-544036806 we could achieve the same behavior with Darknet.

dkurt commented 4 years ago

@VideoPlayerCode, @WongKinYiu, probably, you can be also interested adding this network to our experimental project with accuracy/efficiency diagrams: https://github.com/dkurt/dl_tradeoff. You may see how it looks like in https://dkurt.github.io/dl_tradeoff/.

WongKinYiu commented 4 years ago

@dkurt looks great!

Arcitec commented 4 years ago

Results!

@dkurt @WongKinYiu Hello, the results are here. I finally had some time to train and test a 1-class PRN version! This is a followup from https://github.com/opencv/opencv/pull/15739#issuecomment-544217740

I re-ran all tests today, since my CPU is faster today, so it wouldn't have been comparable to the earlier tests.

And yes, the theory was correct! 1-class Tiny PRN is super fast just as guessed!

80 Classes (COCO):
yolov3-tiny, 80 classes (COCO): ~26.5894 ms (37.61 FPS), total: 10635.8 ms for 400 runs
yolov3-tiny-prn, 80 classes (COCO): ~19.6459 ms (50.90 FPS), total: 7858.38 ms for 400 runs

1 Class (Custom Training):
yolov3-tiny, 1 class (my own): ~23.7224 ms (42.16 FPS), total: 9488.94 ms for 400 runs
yolov3-tiny-prn, 1 class (my own): ~17.2113 ms (58.10 FPS), total: 6884.53 ms for 400 runs

(I mostly use 1-2 classes, and I don't want to depend on GPUs/CUDA, so the fact that this new network reaches almost 60 FPS on CPU on 1 class is incredible! And so is the fact that it even reaches 51 FPS on 80 classes, so it is suitable for heavy use too!)

Also, during training, I saw that the mAP for YOLOv3-Tiny-PRN is pretty much identical to YOLOv3-Tiny, despite being a much smaller network. The accuracy of the PRN network can also be verified at Wong Kin Yiu's graphs comparing the two nets here on a per-class basis: https://github.com/AlexeyAB/darknet/issues/4091#issuecomment-542513900 ... @WongKinYiu thank you for this genius network design.

In short: YOLOv3-Tiny-PRN gives 35-40% more FPS, with the same accuracy. Thank you both so much for designing the network and for implementing it in OpenCV!

Arcitec commented 4 years ago

PS: About the non-detected bus on the left side of the test image:

i can not find the reason for problem in the image since i do not know implementation details of opencv dnn. i have met this kind of problem when translate pytorch or caffe model to darknet, their implementation of stride of max-pooling layer has a little bit difference.

This theory sounds right. Probably subtle differences in some layer implementations. Well, it's okay, the implementation is near perfect and finds all objects with the same accuracies as in Darknet itself! (For example the car that was misdetected as a truck was 54% in both Darknet and OpenCV).

And as for the way Darknet has more boxes than OpenCV (there are many overlapping/duplicate boxes in the Darknet photos), it's probably what I guessed: Differences in NMS implementation.

Either way, it's clear that the shortcut layer is perfectly implemented now and that this ticket is ready for merge. Deep thanks for all your great work @dkurt!

Arcitec commented 4 years ago

❤️ @alalek

Arcitec commented 4 years ago

For anyone who needs this new feature right now on Python and can't wait 3 months for the next OpenCV version:

I've successfully compiled and packaged a patched Python module, and documented the process here: https://github.com/skvark/opencv-python/issues/254

The guide is duplicated here, for convenience:


This was inspired by needing a brand new feature (https://github.com/opencv/opencv/pull/15739) even though OpenCV only officially releases new versions ~4 times per year. I don't have time to wait for 3 months, so I needed to build a .whl file to distribute to users immediately.

The build process was pretty easy, but complicated at the same time. So this documents the entire process to help others!

Requirements

  1. Python 3 64-bit. Go to https://www.python.org/downloads/windows/ and under the correct heading (currently it's Python 3.7.5), get the Windows x86-64 executable installer. You do NOT want the default 32-bit version! And during the install you MUST choose "Add to PATH".

  2. Install the free Visual Studio 2019 Community Edition, and enable the C++ language package during install.

  3. Install cmake for Windows. Go to https://cmake.org/download/ and get the win64-x64 version.

  4. Install git for Windows. Go to https://git-scm.com/downloads and it should auto-select the 64-bit version. During install you must add Git to Path, and install the git-bash tool (which is a Cygwin linux-like terminal for Windows).

  5. Add cmake and Visual Studio's compiler to your system PATH. Press Windows + i to open Settings, then search for environment and choose "edit the system environment variables". Click on "Environment Variables" in the window that pops up. Then double-click on PATH. Add the correct folders. I had to add these two: A) C:\Program Files\CMake\bin and B) C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Current\Bin\amd64. Without doing this, cmake or the compiler will not be found!

Preparing to Build!

  1. Clone the OpenCV-Python repository and all submodules. This is the repo used for making the semi-official pip (pypi.org) packages for opencv, so you can trust that it will do the right thing.

    git clone --recurse-submodules https://github.com/skvark/opencv-python.git
  2. Put your custom patch in the opencv-python/patches/ folder. In my example, I want https://github.com/opencv/opencv/pull/15739.patch (from https://github.com/opencv/opencv/pull/15739, which adds support for a great new neural network). To do this, it's as easy as starting "Git Bash" and navigating to the opencv-python/patches/ and then typing the following command.

    curl -O "https://patch-diff.githubusercontent.com/raw/opencv/opencv/pull/15739.patch"
  3. Edit setup.py in a text editor and search for "Visual Studio". It will be set to an ancient version for compatibility with old Windows versions, but I don't care about that. Edit the line as follows to make it use Visual Studio 2019 (comment out the old line as seen below, and add two new lines):

        #"-G", "Visual Studio 14" + (" Win64" if x64 else '')
        "-G", "Visual Studio 16 2019",
        "-A", "x64" if x64 else "Win32"
  4. Now it's time to add the patch command. Look for the line that says if 'CMAKE_ARGS' in os.environ:, and ABOVE THAT LINE, insert the following.

    subprocess.check_call("patch -p1 -d opencv < patches/15739.patch", shell=True)

Building!

  1. Now we're going to run the remaining build process as described at https://github.com/skvark/opencv-python#build-process. You must do this step in Git Bash (so that you have support for passing in the ENABLE_CONTRIB variable as seen below), and type the following command to build the Python package.

    ENABLE_CONTRIB=1 python setup.py bdist_wheel
  2. (Optional) If the build above fails and you need to re-build (after fixing whatever was wrong), you must first "clean" the source dir again since the patch modifies the source files and would fail to apply itself the next time. If so, just enter the opencv-python/opencv folder and type the following.

    git reset --hard
  3. On Mac/Linux there seems to be some extra step to fix the package before using it. There's no such step on Windows.

  4. The built package now exists as opencv-python/_skbuild/win-amd64-3.7/setuptools/lib.win-amd64-3.7 (if you want to look at what is inside the wheel), and the wheel file is in opencv-python/dist.

After the Build!

  1. The instructions on this repo says that step 6 is "Rearrange OpenCV's build result, add our custom files and generate wheel", which makes no sense (rearrange what? add what custom files?). Either way, the folder contents are already perfect (they completely match what the pypi package installs). So now we'll simply use the finished wheel file, which is ready for easy distribution...

  2. Navigate to opencv-python/dist, and forcibly install the wheel (so that it overwrites any existing official installation), by typing the following command. The exact name of the wheel file will vary based on version.

    python -m pip install --upgrade --force-reinstall opencv_contrib_python-4.1.1.26-cp37-cp37m-win_amd64.whl
  3. Distribute the .whl file to anyone who needs your custom-built OpenCV, until the pypi package catches up and you can replace it with the "official" version again.

Celebrate.

https://www.youtube.com/watch?v=maAFcEU6atk

mmaaz60 commented 4 years ago

Results!

@dkurt @WongKinYiu Hello, the results are here. I finally had some time to train and test a 1-class PRN version! This is a followup from #15739 (comment)

I re-ran all tests today, since my CPU is faster today, so it wouldn't have been comparable to the earlier tests.

And yes, the theory was correct! 1-class Tiny PRN is super fast just as guessed!

80 Classes (COCO):
yolov3-tiny, 80 classes (COCO): ~26.5894 ms (37.61 FPS), total: 10635.8 ms for 400 runs
yolov3-tiny-prn, 80 classes (COCO): ~19.6459 ms (50.90 FPS), total: 7858.38 ms for 400 runs

1 Class (Custom Training):
yolov3-tiny, 1 class (my own): ~23.7224 ms (42.16 FPS), total: 9488.94 ms for 400 runs
yolov3-tiny-prn, 1 class (my own): ~17.2113 ms (58.10 FPS), total: 6884.53 ms for 400 runs

(I mostly use 1-2 classes, and I don't want to depend on GPUs/CUDA, so the fact that this new network reaches almost 60 FPS on CPU on 1 class is incredible! And so is the fact that it even reaches 51 FPS on 80 classes, so it is suitable for heavy use too!)

Also, during training, I saw that the mAP for YOLOv3-Tiny-PRN is pretty much identical to YOLOv3-Tiny, despite being a much smaller network. The accuracy of the PRN network can also be verified at Wong Kin Yiu's graphs comparing the two nets here on a per-class basis: AlexeyAB/darknet#4091 (comment) ... @WongKinYiu thank you for this genius network design.

In short: YOLOv3-Tiny-PRN gives 35-40% more FPS, with the same accuracy. Thank you both so much for designing the network and for implementing it in OpenCV!

Thanks @VideoPlayerCode for the amazing work. Can you please share the exact CPU model as well. Also were you using SSD or HDD? Thanks

WongKinYiu commented 4 years ago

@mmaaz60 Hello,

@VideoPlayerCode use i7-8750H: 80classes, 50.90 FPS (best performance mode, without display results).

@WongKinYiu use following settings: i9-9900K: 80 classes, 65.13 FPS (ssd, display results). i7-9750H: 80 classes, 46.46 FPS (ssd, normal mode, display results). i7-7700: 80 classes, 34.83 FPS (hdd, display results). i7-6700: 80 classes, 34.21 FPS (hdd, display results).

mmaaz60 commented 4 years ago

Thanks @WongKinYiu ,

It really helps. Just wanted to share that when I used original TIny Yolov3 (not pruned one) with OpenCV-dnn module and Inference Engine Backend, I got around 39 FPS on i7-6700 (20 classes, hdd, display off).

So using Inference Engine Bakend will further increase the FPS of Tiny Yolov3 Pruned. Will share the benchmarks once done.

Also it will be awesome if you can repeat the benchmarks with OpenCV-dnn and Inference Engine Backend. Thanks

WongKinYiu commented 4 years ago

@mmaaz60

YOLOv3-tiny-PRN:

YOLOv3-tiny:

mmaaz60 commented 4 years ago

Hi @WongKinYiu,

Have you tried using OpenVino inference engine backend with Yolov3-tiny-PRN? It will surely improve speed.

Thanks

WongKinYiu commented 4 years ago

@mmaaz60

Not yet, i m not familiar with OpenVino.

isgursoy commented 4 years ago

following your magic