DNN inference is using 3.5 time more memory for 4.8.0 when compared to 4.5.2

ukoehler commented 9 months ago

System Information

OpenCV version 4.8.0 vs. 4.5.2 compiled from source Operating System: Both Windows and Linux Compiler GCC 11

Detailed description

During regression testing between version 4.8.0 and 4.5.2 most regression tests ran perfect. All results were the same, however the runtime with larger data sets increased dramatically. For smaller dataset I maybe see a slightly longer runtime. However, it turns out that swapping was the problem for larger datasets.

I used valgrind massif to measure memory usage for a singled out unit test with one of the networks and noticed that 4.5.2 used 553,8 MB of RAM while version 4.8.0 needed 1.89 GB of RAM for the same task.

Find the network data here: https://drive.google.com/file/d/1hPetNOt76xze8cD3DriDuZira65IeKc4/view?usp=sharing and https://drive.google.com/file/d/1swyevb0xsQhKeFxHOz-G00Dl-mz2wVCD/view?usp=sharing

As requested in issue https://github.com/opencv/opencv/issues/23223 I provide the performance data (there was NO SWAPPING) for 4.8.0:

conv_0 Convolution 2.84259e+07
bn_0 BatchNorm 0
leaky_1 ReLU 0
conv_1 Convolution 6.85735e+07
bn_1 BatchNorm 0
leaky_2 ReLU 0
conv_2 Convolution 1.24519e+07
bn_2 BatchNorm 0
leaky_3 ReLU 0
conv_3 Convolution 2.511e+07
bn_3 BatchNorm 0
leaky_4 ReLU 0
shortcut_4 Eltwise 1.83846e+06
conv_5 Convolution 6.18066e+07
bn_5 BatchNorm 0
leaky_6 ReLU 0
conv_6 Convolution 8.75183e+06
bn_6 BatchNorm 0
leaky_7 ReLU 0
conv_7 Convolution 2.21703e+07
bn_7 BatchNorm 0
leaky_8 ReLU 0
shortcut_8 Eltwise 1.19246e+06
conv_9 Convolution 9.9281e+06
bn_9 BatchNorm 0
leaky_10 ReLU 0
conv_10 Convolution 2.12841e+07
bn_10 BatchNorm 0
leaky_11 ReLU 0
shortcut_11 Eltwise 1.27645e+06
conv_12 Convolution 6.04233e+07
bn_12 BatchNorm 0
leaky_13 ReLU 0
conv_13 Convolution 7.77107e+06
bn_13 BatchNorm 0
leaky_14 ReLU 0
conv_14 Convolution 2.67491e+07
bn_14 BatchNorm 0
leaky_15 ReLU 0
shortcut_15 Eltwise 927776
conv_16 Convolution 8.20111e+06
bn_16 BatchNorm 0
leaky_17 ReLU 0
conv_17 Convolution 2.45839e+07
bn_17 BatchNorm 0
leaky_18 ReLU 0
shortcut_18 Eltwise 905403
conv_19 Convolution 8.05668e+06
bn_19 BatchNorm 0
leaky_20 ReLU 0
conv_20 Convolution 2.45622e+07
bn_20 BatchNorm 0
leaky_21 ReLU 0
shortcut_21 Eltwise 876047
conv_22 Convolution 8.17928e+06
bn_22 BatchNorm 0
leaky_23 ReLU 0
conv_23 Convolution 2.47782e+07
bn_23 BatchNorm 0
leaky_24 ReLU 0
shortcut_24 Eltwise 869224
conv_25 Convolution 8.12968e+06
bn_25 BatchNorm 0
leaky_26 ReLU 0
conv_26 Convolution 2.407e+07
bn_26 BatchNorm 0
leaky_27 ReLU 0
shortcut_27 Eltwise 867241
conv_28 Convolution 9.44835e+06
bn_28 BatchNorm 0
leaky_29 ReLU 0
conv_29 Convolution 2.51462e+07
bn_29 BatchNorm 0
leaky_30 ReLU 0
shortcut_30 Eltwise 861329
conv_31 Convolution 8.20646e+06
bn_31 BatchNorm 0
leaky_32 ReLU 0
conv_32 Convolution 2.53283e+07
bn_32 BatchNorm 0
leaky_33 ReLU 0
shortcut_33 Eltwise 827745
conv_34 Convolution 8.03971e+06
bn_34 BatchNorm 0
leaky_35 ReLU 0
conv_35 Convolution 2.46168e+07
bn_35 BatchNorm 0
leaky_36 ReLU 0
shortcut_36 Eltwise 868062
conv_37 Convolution 6.40316e+07
bn_37 BatchNorm 0
leaky_38 ReLU 0
conv_38 Convolution 7.39178e+06
bn_38 BatchNorm 0
leaky_39 ReLU 0
conv_39 Convolution 5.90104e+07
bn_39 BatchNorm 0
leaky_40 ReLU 0
shortcut_40 Eltwise 627514
conv_41 Convolution 8.2697e+06
bn_41 BatchNorm 0
leaky_42 ReLU 0
conv_42 Convolution 5.19797e+07
bn_42 BatchNorm 0
leaky_43 ReLU 0
shortcut_43 Eltwise 626462
conv_44 Convolution 8.17896e+06
bn_44 BatchNorm 0
leaky_45 ReLU 0
conv_45 Convolution 5.59095e+07
bn_45 BatchNorm 0
leaky_46 ReLU 0
shortcut_46 Eltwise 638976
conv_47 Convolution 8.28036e+06
bn_47 BatchNorm 0
leaky_48 ReLU 0
conv_48 Convolution 5.31428e+07
bn_48 BatchNorm 0
leaky_49 ReLU 0
shortcut_49 Eltwise 628797
conv_50 Convolution 8.21851e+06
bn_50 BatchNorm 0
leaky_51 ReLU 0
conv_51 Convolution 5.88002e+07
bn_51 BatchNorm 0
leaky_52 ReLU 0
shortcut_52 Eltwise 678821
conv_53 Convolution 8.20721e+06
bn_53 BatchNorm 0
leaky_54 ReLU 0
conv_54 Convolution 5.25305e+07
bn_54 BatchNorm 0
leaky_55 ReLU 0
shortcut_55 Eltwise 643926
conv_56 Convolution 8.41081e+06
bn_56 BatchNorm 0
leaky_57 ReLU 0
conv_57 Convolution 5.20968e+07
bn_57 BatchNorm 0
leaky_58 ReLU 0
shortcut_58 Eltwise 637153
conv_59 Convolution 8.26404e+06
bn_59 BatchNorm 0
leaky_60 ReLU 0
conv_60 Convolution 6.02537e+07
bn_60 BatchNorm 0
leaky_61 ReLU 0
shortcut_61 Eltwise 620270
conv_62 Convolution 7.23844e+07
bn_62 BatchNorm 0
leaky_63 ReLU 0
conv_63 Convolution 9.8334e+06
bn_63 BatchNorm 0
leaky_64 ReLU 0
conv_64 Convolution 2.18966e+08
bn_64 BatchNorm 0
leaky_65 ReLU 0
shortcut_65 Eltwise 437773
conv_66 Convolution 9.76036e+06
bn_66 BatchNorm 0
leaky_67 ReLU 0
conv_67 Convolution 2.08698e+08
bn_67 BatchNorm 0
leaky_68 ReLU 0
shortcut_68 Eltwise 433034
conv_69 Convolution 9.82622e+06
bn_69 BatchNorm 0
leaky_70 ReLU 0
conv_70 Convolution 1.76859e+08
bn_70 BatchNorm 0
leaky_71 ReLU 0
shortcut_71 Eltwise 428305
conv_72 Convolution 9.81772e+06
bn_72 BatchNorm 0
leaky_73 ReLU 0
conv_73 Convolution 1.81768e+08
bn_73 BatchNorm 0
leaky_74 ReLU 0
shortcut_74 Eltwise 433105
conv_75 Convolution 1.01164e+07
bn_75 BatchNorm 0
leaky_76 ReLU 0
conv_76 Convolution 1.83666e+08
bn_76 BatchNorm 0
leaky_77 ReLU 0
conv_77 Convolution 9.73596e+06
bn_77 BatchNorm 0
leaky_78 ReLU 0
conv_78 Convolution 1.88721e+08
bn_78 BatchNorm 0
leaky_79 ReLU 0
conv_79 Convolution 9.82813e+06
bn_79 BatchNorm 0
leaky_80 ReLU 0
conv_80 Convolution 1.83166e+08
bn_80 BatchNorm 0
leaky_81 ReLU 0
conv_81 Convolution 5.09009e+06
permute_82 Permute 95973
yolo_82 Region 920492
identity_83 Identity 2254
conv_84 Convolution 2.71254e+06
bn_84 BatchNorm 0
leaky_85 ReLU 0
upsample_85 Resize 836693
concat_86 Concat 272749
conv_87 Convolution 1.14419e+07
bn_87 BatchNorm 0
leaky_88 ReLU 0
conv_88 Convolution 6.20935e+07
bn_88 BatchNorm 0
leaky_89 ReLU 0
conv_89 Convolution 8.19448e+06
bn_89 BatchNorm 0
leaky_90 ReLU 0
conv_90 Convolution 5.79426e+07
bn_90 BatchNorm 0
leaky_91 ReLU 0
conv_91 Convolution 7.96112e+06
bn_91 BatchNorm 0
leaky_92 ReLU 0
conv_92 Convolution 5.80056e+07
bn_92 BatchNorm 0
leaky_93 ReLU 0
conv_93 Convolution 7.66066e+06
permute_94 Permute 195382
yolo_94 Region 3.38031e+06
identity_95 Identity 2866
conv_96 Convolution 2.2153e+06
bn_96 BatchNorm 0
leaky_97 ReLU 0
upsample_97 Resize 739277
concat_98 Concat 548914
conv_99 Convolution 1.13213e+07
bn_99 BatchNorm 0
leaky_100 ReLU 0
conv_100 Convolution 2.55236e+07
bn_100 BatchNorm 0
leaky_101 ReLU 0
conv_101 Convolution 7.85209e+06
bn_101 BatchNorm 0
leaky_102 ReLU 0
conv_102 Convolution 2.45604e+07
bn_102 BatchNorm 0
leaky_103 ReLU 0
conv_103 Convolution 7.7966e+06
bn_103 BatchNorm 0
leaky_104 ReLU 0
conv_104 Convolution 2.83962e+07
bn_104 BatchNorm 0
leaky_105 ReLU 0
conv_105 Convolution 1.36612e+07
permute_106 Permute 1.00528e+06
yolo_106 Region 1.32232e+07

and 4.5.2

conv_0 Convolution 2.6347e+07
bn_0 BatchNorm 0
leaky_1 ReLU 0
conv_1 Convolution 7.16753e+07
bn_1 BatchNorm 0
leaky_2 ReLU 0
conv_2 Convolution 1.12384e+07
bn_2 BatchNorm 0
leaky_3 ReLU 0
conv_3 Convolution 7.2381e+07
bn_3 BatchNorm 0
leaky_4 ReLU 0
shortcut_4 Eltwise 3.77401e+06
conv_5 Convolution 6.44327e+07
bn_5 BatchNorm 0
leaky_6 ReLU 0
conv_6 Convolution 9.09992e+06
bn_6 BatchNorm 0
leaky_7 ReLU 0
conv_7 Convolution 6.48448e+07
bn_7 BatchNorm 0
leaky_8 ReLU 0
shortcut_8 Eltwise 1.94067e+06
conv_9 Convolution 9.30628e+06
bn_9 BatchNorm 0
leaky_10 ReLU 0
conv_10 Convolution 6.53347e+07
bn_10 BatchNorm 0
leaky_11 ReLU 0
shortcut_11 Eltwise 1.98969e+06
conv_12 Convolution 6.24834e+07
bn_12 BatchNorm 0
leaky_13 ReLU 0
conv_13 Convolution 7.92278e+06
bn_13 BatchNorm 0
leaky_14 ReLU 0
conv_14 Convolution 6.22588e+07
bn_14 BatchNorm 0
leaky_15 ReLU 0
shortcut_15 Eltwise 1.02496e+06
conv_16 Convolution 7.93264e+06
bn_16 BatchNorm 0
leaky_17 ReLU 0
conv_17 Convolution 6.22617e+07
bn_17 BatchNorm 0
leaky_18 ReLU 0
shortcut_18 Eltwise 1.05038e+06
conv_19 Convolution 7.85361e+06
bn_19 BatchNorm 0
leaky_20 ReLU 0
conv_20 Convolution 6.43121e+07
bn_20 BatchNorm 0
leaky_21 ReLU 0
shortcut_21 Eltwise 1.03558e+06
conv_22 Convolution 7.8001e+06
bn_22 BatchNorm 0
leaky_23 ReLU 0
conv_23 Convolution 6.2343e+07
bn_23 BatchNorm 0
leaky_24 ReLU 0
shortcut_24 Eltwise 1.06368e+06
conv_25 Convolution 8.00235e+06
bn_25 BatchNorm 0
leaky_26 ReLU 0
conv_26 Convolution 6.22367e+07
bn_26 BatchNorm 0
leaky_27 ReLU 0
shortcut_27 Eltwise 1.06686e+06
conv_28 Convolution 7.59016e+06
bn_28 BatchNorm 0
leaky_29 ReLU 0
conv_29 Convolution 6.22161e+07
bn_29 BatchNorm 0
leaky_30 ReLU 0
shortcut_30 Eltwise 1.0338e+06
conv_31 Convolution 7.76942e+06
bn_31 BatchNorm 0
leaky_32 ReLU 0
conv_32 Convolution 6.23441e+07
bn_32 BatchNorm 0
leaky_33 ReLU 0
shortcut_33 Eltwise 1.05267e+06
conv_34 Convolution 7.66518e+06
bn_34 BatchNorm 0
leaky_35 ReLU 0
conv_35 Convolution 6.19013e+07
bn_35 BatchNorm 0
leaky_36 ReLU 0
shortcut_36 Eltwise 1.04865e+06
conv_37 Convolution 6.41796e+07
bn_37 BatchNorm 0
leaky_38 ReLU 0
conv_38 Convolution 7.42708e+06
bn_38 BatchNorm 0
leaky_39 ReLU 0
conv_39 Convolution 6.40619e+07
bn_39 BatchNorm 0
leaky_40 ReLU 0
shortcut_40 Eltwise 624418
conv_41 Convolution 7.63641e+06
bn_41 BatchNorm 0
leaky_42 ReLU 0
conv_42 Convolution 6.43142e+07
bn_42 BatchNorm 0
leaky_43 ReLU 0
shortcut_43 Eltwise 652441
conv_44 Convolution 7.49428e+06
bn_44 BatchNorm 0
leaky_45 ReLU 0
conv_45 Convolution 6.42385e+07
bn_45 BatchNorm 0
leaky_46 ReLU 0
shortcut_46 Eltwise 631932
conv_47 Convolution 7.50422e+06
bn_47 BatchNorm 0
leaky_48 ReLU 0
conv_48 Convolution 6.4686e+07
bn_48 BatchNorm 0
leaky_49 ReLU 0
shortcut_49 Eltwise 597918
conv_50 Convolution 7.67551e+06
bn_50 BatchNorm 0
leaky_51 ReLU 0
conv_51 Convolution 6.40918e+07
bn_51 BatchNorm 0
leaky_52 ReLU 0
shortcut_52 Eltwise 652051
conv_53 Convolution 7.59911e+06
bn_53 BatchNorm 0
leaky_54 ReLU 0
conv_54 Convolution 6.41076e+07
bn_54 BatchNorm 0
leaky_55 ReLU 0
shortcut_55 Eltwise 640980
conv_56 Convolution 7.52332e+06
bn_56 BatchNorm 0
leaky_57 ReLU 0
conv_57 Convolution 6.55695e+07
bn_57 BatchNorm 0
leaky_58 ReLU 0
shortcut_58 Eltwise 714310
conv_59 Convolution 7.76392e+06
bn_59 BatchNorm 0
leaky_60 ReLU 0
conv_60 Convolution 6.40974e+07
bn_60 BatchNorm 0
leaky_61 ReLU 0
shortcut_61 Eltwise 631242
conv_62 Convolution 7.06249e+07
bn_62 BatchNorm 0
leaky_63 ReLU 0
conv_63 Convolution 8.10891e+06
bn_63 BatchNorm 0
leaky_64 ReLU 0
conv_64 Convolution 7.02548e+07
bn_64 BatchNorm 0
leaky_65 ReLU 0
shortcut_65 Eltwise 611314
conv_66 Convolution 1.01592e+07
bn_66 BatchNorm 0
leaky_67 ReLU 0
conv_67 Convolution 7.04279e+07
bn_67 BatchNorm 0
leaky_68 ReLU 0
shortcut_68 Eltwise 464534
conv_69 Convolution 8.1087e+06
bn_69 BatchNorm 0
leaky_70 ReLU 0
conv_70 Convolution 7.0373e+07
bn_70 BatchNorm 0
leaky_71 ReLU 0
shortcut_71 Eltwise 740880
conv_72 Convolution 1.54541e+07
bn_72 BatchNorm 0
leaky_73 ReLU 0
conv_73 Convolution 7.02675e+07
bn_73 BatchNorm 0
leaky_74 ReLU 0
shortcut_74 Eltwise 443173
conv_75 Convolution 8.13352e+06
bn_75 BatchNorm 0
leaky_76 ReLU 0
conv_76 Convolution 7.05095e+07
bn_76 BatchNorm 0
leaky_77 ReLU 0
conv_77 Convolution 8.21635e+06
bn_77 BatchNorm 0
leaky_78 ReLU 0
conv_78 Convolution 7.65501e+07
bn_78 BatchNorm 0
leaky_79 ReLU 0
conv_79 Convolution 1.01681e+07
bn_79 BatchNorm 0
leaky_80 ReLU 0
conv_80 Convolution 7.04393e+07
bn_80 BatchNorm 0
leaky_81 ReLU 0
conv_81 Convolution 4.17549e+06
permute_82 Permute 95582
yolo_82 Region 911274
identity_83 Identity 3156
conv_84 Convolution 2.1118e+06
bn_84 BatchNorm 0
leaky_85 ReLU 0
upsample_85 Resize 1.04758e+06
concat_86 Concat 279501
conv_87 Convolution 1.13795e+07
bn_87 BatchNorm 0
leaky_88 ReLU 0
conv_88 Convolution 6.51143e+07
bn_88 BatchNorm 0
leaky_89 ReLU 0
conv_89 Convolution 7.61145e+06
bn_89 BatchNorm 0
leaky_90 ReLU 0
conv_90 Convolution 6.43658e+07
bn_90 BatchNorm 0
leaky_91 ReLU 0
conv_91 Convolution 7.65415e+06
bn_91 BatchNorm 0
leaky_92 ReLU 0
conv_92 Convolution 6.39612e+07
bn_92 BatchNorm 0
leaky_93 ReLU 0
conv_93 Convolution 7.23372e+06
permute_94 Permute 266236
yolo_94 Region 3.42418e+06
identity_95 Identity 3046
conv_96 Convolution 2.09349e+06
bn_96 BatchNorm 0
leaky_97 ReLU 0
upsample_97 Resize 766659
concat_98 Concat 398589
conv_99 Convolution 1.33242e+07
bn_99 BatchNorm 0
leaky_100 ReLU 0
conv_100 Convolution 6.18128e+07
bn_100 BatchNorm 0
leaky_101 ReLU 0
conv_101 Convolution 7.52835e+06
bn_101 BatchNorm 0
leaky_102 ReLU 0
conv_102 Convolution 6.18159e+07
bn_102 BatchNorm 0
leaky_103 ReLU 0
conv_103 Convolution 7.52287e+06
bn_103 BatchNorm 0
leaky_104 ReLU 0
conv_104 Convolution 6.1976e+07
bn_104 BatchNorm 0
leaky_105 ReLU 0
conv_105 Convolution 1.44498e+07
permute_106 Permute 877501
yolo_106 Region 1.30326e+07

Steps to reproduce

        std::string fileName = (*iter);
        cv::Mat image = imreadpng( fileName, cv::IMREAD_UNCHANGED );
        cv::Mat blob = cv::dnn::blobFromImage(image, 1/255.0, cv::Size(416, 416), cv::Scalar(0, 0, 0), true, false);
        net.setInput(blob);
        std::vector<cv::Mat> outs;
        net.forward(outs, outputnames);

        std::vector<double> timings;
        net.getPerfProfile(timings);
        std::vector<std::string> names = net.getLayerNames();
        CV_Assert(names.size() == timings.size());
        for (int i = 0; i < names.size(); ++i) {
            cv::Ptr<cv::dnn::Layer> l = net.getLayer(net.getLayerId(names[i]));
            std::cout << names[i] << " " << l->type << " " << timings[i] << std::endl;
        }

Issue submission checklist

[X] I report the issue, it's not a question
[X] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
[X] I updated to the latest OpenCV version and the issue is still there
[X] There is reproducer code and related data files (videos, images, onnx, etc)

dkurt commented 9 months ago

Seems like I misunderstood you initially. The speed of the model if OK but on 4.8.0 it consumes more memory, is that right?

ukoehler commented 9 months ago

Yes and that might lead to swapping and drastically longer runtimes as reported in the other 2 issues.

zihaomu commented 9 months ago

Hi @ukoehler, in 4.8.0, we use the Winograd optimize for 3x3 stride 1 convolution, which speeds up the inference speed but costs more memory (maybe about 3 times). If you don't want this feature, you can use net.enableWinograd(false); to disable it. In addition, for other types of convolution, we will repack the weight blob first to get better L1 and L2 cache hit rate, it will cost twice the use of the memory.

In conclusion, the memory (4.8) consumed is at least twice as much as before (4.5). Since we have saved the original convolution weight and re-packed the weight blob. And in the future, we will have the full graph optimized and a better memory strategy for optimizing this part.

opencv-alalek commented 9 months ago

FYI: #22825

ukoehler commented 9 months ago

Hmm, if that was merged Jan 8th, why do I still see such a high memory usage?

zihaomu commented 9 months ago

Hmm, if that was merged Jan 8th, why do I still see such a high memory usage?

hi @ukoehler plese check the detail, it is not fixed completely. And due to current dnn memory strategy, it can not be fixed in 4.x. We try to completely fix it in 5.x.

ukoehler commented 9 months ago

I did read that the change brought little advance for some models and the tests showed it broke others. Then the tests were just disabled.

ukoehler commented 9 months ago

Just ran other tests with non-YOLO models: memory usage increased from 1.352 GB to 3.147 GB and it looks like it is leaking memory:

4.5.2:

    GB
1.352^        #
     |        #
     |        #                 :::::::::::::::::::::::::::::::@::::::@::::::@
     |        #             :::@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |        #          :::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |        #      @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |        #::::::@::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |        #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |        #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |        #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |        #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |        #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |        #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |      @@#: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |      @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |      @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |      @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |    ::@ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |    : @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
     |    : @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   90.58

4.8.0:

    GB
3.147^                                                                       #
     |                                                                     @@#
     |                                                               @@@:@:@@#
     |                                                          @@:@:@ @:@ @@#
     |                                                          @ :@:@ @:@ @@#
     |                                                          @ :@:@ @:@ @@#
     |                                                        :@@ :@:@ @:@ @@#
     |                                                    @::::@@ :@:@ @:@ @@#
     |                                             @@:::::@: ::@@ :@:@ @:@ @@#
     |                                        :::@:@ :: : @: ::@@ :@:@ @:@ @@#
     |             @                     :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
     |             @           :::@:@:::::@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
     |             @    ::::@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
     |          @@@@:::::: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
     |          @  @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
     |         @@  @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
     |         @@  @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
     |       @:@@  @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
     |      @@:@@  @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
     |    @@@@:@@  @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
   0 +----------------------------------------------------------------------->Gi
     0                                                                   86.42

JulienMaille commented 9 months ago

@ukoehler I'm curious, have you tried disabling Winograd?

ukoehler commented 9 months ago

No. There is no build switch I could find in the documentation. How would I got about it. Very eager to test.

JulienMaille commented 9 months ago

I though @zihaomu said you could disable it with net.enableWinograd(false);

zihaomu commented 9 months ago

you could disable it with net.enableWinograd(false);

Right. but after disabling that, it still costs about twice the memory than 4.5.2, since we repack every convolution weight blob. We try to fix it at 5.x completely.

ukoehler commented 9 months ago

Hmm, found it and did a speed test. net.enableWinograd(false) makes it faster ???? It returns to the 4.5.2 speed. The first memory test is still running

zihaomu commented 9 months ago

net.enableWinograd(false) makes it faster ???

That depends on your machine platform. For some models, the main time-consuming bottleneck is in memory rather than in computation, winograd will indeed lead to a time-consuming extension because it needs more memory. Currently, Winograd branch works very well on ARMv8 platform. Based on my test, about 30% speed up for Mac M1, Resnet50. And for x86, we do implement the AVX/AVX2 instruction for it. It will get about 10% speed up, on my AMD 5600X, Resnet50. Speed is a tricky thing, it depends on your specific platform and the specific model you're using.

ukoehler commented 9 months ago

For the currently running test on a AMD EPIC 7302P virtual machine with AVX2 enabled. It slows down from 1.368 s to 4.351 s. In my recent speed tests I saw fluctuations of 18%.

ukoehler commented 9 months ago

net.enableWinograd(false) still leaks memory and uses therefore about twice as much as version 4.5.2. I do not consider this an optimization case, but a serious bug.

    GB
2.213^                                                                       :
     |                                                 @@@:::::::::::::::@@:#:
     |                                                 @  ::  ::  :  ::  @ :#:
     |                                                 @  ::  ::  :  ::  @ :#:
     |                                                 @  ::  ::  :  ::  @ :#:
     |                                                 @  ::  ::  :  ::  @ :#:
     |            @                           :::::::@@@  ::  ::  :  ::  @ :#:
     |            @               :@@:::::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |            @         :::::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |            @    :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |            @:::::::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |         @@@@: : :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |         @  @: : :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |        @@  @: : :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |        @@  @: : :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |        @@  @: : :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |       :@@  @: : :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |      @:@@  @: : :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |    @:@:@@  @: : :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
     |    @:@:@@  @: : :::::: :::::@ :: ::@::::: ::::@ @  ::  ::  :  ::  @ :#:
   0 +----------------------------------------------------------------------->Gi
     0                                                                   94.28

asmorkalov commented 9 months ago

@zihaomu could you check memory leaks?

ukoehler commented 9 months ago

@zihaomu find the code and models in this issue: https://github.com/opencv/opencv/issues/23982 or even better here: https://github.com/opencv/opencv/issues/24041

I am currently checking the YOLO3 model with net.enableWinograd(false), but valgrind takes ages ...

zihaomu commented 9 months ago

Hi @asmorkalov, I will work on it.

ukoehler commented 9 months ago

@zihaomu: Here is the RAM graph for the YOLO3 code shown above with net.enableWinograd(false):

    MB
793.6^                     :
     |                  :##:::@:::::::::::::::::::::@@:::::::::::::::@::::::::
     |                 @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |                 @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |               ::@:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |            :::: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |       ::@@@: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |   ::::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |   : ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |   : ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |  @: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |  @: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |  @: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     |  @: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     | :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     | :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     | :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     | :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     | :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
     | :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   278.1

As a hint: In this case I run several images through the same network. In the graphs shown before, I run one image through 4 networks. It appears that memory is not cleaned up after an inference run (as was the case for version 4.5.2). The memory is cleared when the network is destructed or another inference is triggered. That means that using valgrind to find the problem will not help.

I will run the above test case with Winograd enabled, but that will take a couple of hours.

zihaomu commented 9 months ago

Hi @ukoehler, how can I reproduce your issue? What I'm doing now is run the following code for an hour, and see if the memory is increasing or not.

Test environment: Mac intel, i9.

    while(1)
    {
        std::cout<<"forward i = "<<i<<std::endl;
        std::vector<cv::Mat> ret;
        alexnetNet.forward(ret);
        std::cout << ret.size() << "  " << ret[0].rows << "  " << ret[0].cols << std::endl;

        googlenetNet.forward(ret);
        std::cout << ret.size() << "  " << ret[0].rows << "  " << ret[0].cols << std::endl;

        resnet152Net.forward(ret);
        std::cout << ret.size() << "  " << ret[0].rows << "  " << ret[0].cols << std::endl;

        vgg16Net.forward(ret);
        std::cout << ret.size() << "  " << ret[0].rows << "  " << ret[0].cols << std::endl;
        i++;
    }

ukoehler commented 9 months ago

Hi @zihaomu ,

I do not think the while loop in necessary. I would only run through that once and see accumulating memory usage. Looks like memory is not cleaned after running forward(), but cleaned at destruction and the before the next forward() run.

I use valgrind -v --tool=massif executable_name on Linux. That will create a massif.out.pid file that can be view by calling ms_print massif_file. It looks like your platform should be supported: https://valgrind.org/info/platforms.html

zihaomu commented 9 months ago

Looks like memory is not cleaned after running forward()

Hi @ukoehler, in opencv dnn, we will allocate memory for every layer by layer on the first run (for convolution weight re-packing and Winograd initializing). So, I think the accumulating memory usage is reasonable for the first run, it should not be a memory leak.

If the memory usage is still increasing during the following forward that should be the memory leak.

We can not do the convolution weight re-packing and Winograd initializing at every beginning of forward and deallocate memory at the end of forward. They are quite time-consuming. Based on my Mac M1 chip, the first run of Resnet50 takes about 50 ms and the following run takes about 25 ms.

For a big model, it dose not make sense to forward only with CPU. And such memory increasing aims to speed up forward. For small models, on ARM, we can speed up about 2X, test with ReseNet50.

ukoehler commented 9 months ago

As you can see from the graphs, that behaviour changed from version 4.5.2. The drastic increase in memory usage poses a big problem. The memory usage is not increasing when running the same net again, but is accumulating when running different nets one after the other. It also takes to long to reload the nets from disk every time. Something must have changed that causes that memory usage buildup. Memory should be freed at the end of a forward, or the possibility should at least exist.

As for speed: The changes make all the nets I tested dramatically slower, not faster. For regression alone, Winograd should not be on as a default. Add swapping due to massively increased memory usage (not in any of the cases above. These are just our unit tests) and we would have to seriously think about replacing OpenCV altogether (specially in the light of the other, obscure bugs I found)

zihaomu commented 9 months ago

but is accumulating when running different nets one after the other.

That's correct. We have found this issue, and try to fix it in the 5.x, which needs refactoring all our memory allocation strategy, a lot of work to do.

As for speed: The changes make all the nets I tested dramatically slower, not faster.

Have your test on ARM platform?

JulienMaille commented 9 months ago

@ukoehler are your findings specific to a give backend or all of them?

ukoehler commented 9 months ago

@zihaomu , sorry for not specifying here:

        googlenetNet.setPreferableBackend(cv::dnn::DNN_BACKEND_OPENCV);
        googlenetNet.setPreferableTarget(cv::dnn::DNN_TARGET_CPU);

This is the only platform we target for inference, sorry.

zihaomu commented 9 months ago

Finish the 1 hour `net.forword()` running and without memory increansing.

Hi @ukoehler, @asmorkalov. After running Forward of these models about 10000 times on CPU, I got the following image. And finally, the memory usage is about 2048 MB, you can find this in the image. I think we can remove the bug tag.

ukoehler commented 9 months ago

Again, still a bug. Your test is not testing the problem at all. The memory is wasted while running several nets the first time not one net lots of times.

zihaomu commented 9 months ago

Hi @asmorkalov, please take a look. The reply from my site is https://github.com/opencv/opencv/issues/24134#issuecomment-1674474590.

The memory is wasted while running several nets the first time not one net lots of times.

That's true. Try to fix the future. For my point of view, this is optimization, not a memory leak.

ukoehler commented 9 months ago

Keeping the memory might be fine when only dealing with one network. I am dealing with six nets now and that will likely be increasing. The old behaviour work fine for that, the new behaviour leads to swapping due to the large amount of memory used. Is there a way to at least release memory to restore the old behaviour?

All in all, this is a massive regression problem without warning in the release notes. I am clearly not the only one having the problem as the two other issues about slower inference proof. Looks like the changes have a negative effect for many cases. So far I am just glad that out unit test caught the problem early.

zihaomu commented 9 months ago

The new behaviour leads to swapping due to the large amount of memory used.

Hi @ukoehler, currently we do not have such flag to do so. And in the future, we would like to add a new API like net.setMemoryUsage(DNN_MEMORY_LOW), so that users can choose the memory usage level in their code.

So far I am just glad that our unit test caught the problem early.

That's a good idea.

ukoehler commented 9 months ago

net.setMemoryUsage(DNN_MEMORY_LOW) sound right with that possibly being the default option. We would need that solution rather quickly. For the time being we can stick to version 4.5.2, but there is one net I am training that requires at least 4.7 to load.

asmorkalov commented 2 months ago

Significant memory footprint improvement: https://github.com/opencv/opencv/pull/25163

dkurt commented 2 months ago

@asmorkalov, related to import stage only, but I probably take a look on inference too.

asmorkalov commented 2 months ago

Yes, I know. The significant memory consumption is caused by Winograd branch in convolution.

opencv / opencv