Open ukoehler opened 9 months ago
Seems like I misunderstood you initially. The speed of the model if OK but on 4.8.0 it consumes more memory, is that right?
Yes and that might lead to swapping and drastically longer runtimes as reported in the other 2 issues.
Hi @ukoehler, in 4.8.0, we use the Winograd optimize for 3x3 stride 1 convolution, which speeds up the inference speed but costs more memory (maybe about 3 times). If you don't want this feature, you can use net.enableWinograd(false);
to disable it. In addition, for other types of convolution, we will repack the weight blob first to get better L1 and L2 cache hit rate, it will cost twice the use of the memory.
In conclusion, the memory (4.8) consumed is at least twice as much as before (4.5). Since we have saved the original convolution weight and re-packed the weight blob. And in the future, we will have the full graph optimized and a better memory strategy for optimizing this part.
FYI: #22825
Hmm, if that was merged Jan 8th, why do I still see such a high memory usage?
Hmm, if that was merged Jan 8th, why do I still see such a high memory usage?
hi @ukoehler plese check the detail, it is not fixed completely. And due to current dnn memory strategy, it can not be fixed in 4.x. We try to completely fix it in 5.x.
I did read that the change brought little advance for some models and the tests showed it broke others. Then the tests were just disabled.
Just ran other tests with non-YOLO models: memory usage increased from 1.352 GB to 3.147 GB and it looks like it is leaking memory:
4.5.2:
GB
1.352^ #
| #
| # :::::::::::::::::::::::::::::::@::::::@::::::@
| # :::@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| # :::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| # @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| #::::::@::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| @@#: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| ::@ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| : @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
| : @ #: ::: @::::::: :@:: :: :: :: :::: ::::: :: :::: @ :::::@::::::@
0 +----------------------------------------------------------------------->Gi
0 90.58
4.8.0:
GB
3.147^ #
| @@#
| @@@:@:@@#
| @@:@:@ @:@ @@#
| @ :@:@ @:@ @@#
| @ :@:@ @:@ @@#
| :@@ :@:@ @:@ @@#
| @::::@@ :@:@ @:@ @@#
| @@:::::@: ::@@ :@:@ @:@ @@#
| :::@:@ :: : @: ::@@ :@:@ @:@ @@#
| @ :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
| @ :::@:@:::::@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
| @ ::::@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
| @@@@:::::: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
| @ @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
| @@ @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
| @@ @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
| @:@@ @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
| @@:@@ @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
| @@@@:@@ @::: :: :@:::: @:@::: :@::@: :@:@ :: : @: ::@@ :@:@ @:@ @@#
0 +----------------------------------------------------------------------->Gi
0 86.42
@ukoehler I'm curious, have you tried disabling Winograd?
No. There is no build switch I could find in the documentation. How would I got about it. Very eager to test.
I though @zihaomu said you could disable it with net.enableWinograd(false);
you could disable it with net.enableWinograd(false);
Right. but after disabling that, it still costs about twice the memory than 4.5.2, since we repack every convolution weight blob. We try to fix it at 5.x completely.
Hmm, found it and did a speed test. net.enableWinograd(false) makes it faster ???? It returns to the 4.5.2 speed. The first memory test is still running
net.enableWinograd(false)
makes it faster ???
That depends on your machine platform. For some models, the main time-consuming bottleneck is in memory rather than in computation, winograd will indeed lead to a time-consuming extension because it needs more memory.
Currently, Winograd
branch works very well on ARMv8 platform.
Based on my test, about 30% speed up for Mac M1, Resnet50. And for x86, we do implement the AVX/AVX2 instruction for it. It will get about 10% speed up, on my AMD 5600X, Resnet50.
Speed is a tricky thing, it depends on your specific platform and the specific model you're using.
For the currently running test on a AMD EPIC 7302P virtual machine with AVX2 enabled. It slows down from 1.368 s to 4.351 s. In my recent speed tests I saw fluctuations of 18%.
net.enableWinograd(false) still leaks memory and uses therefore about twice as much as version 4.5.2. I do not consider this an optimization case, but a serious bug.
GB
2.213^ :
| @@@:::::::::::::::@@:#:
| @ :: :: : :: @ :#:
| @ :: :: : :: @ :#:
| @ :: :: : :: @ :#:
| @ :: :: : :: @ :#:
| @ :::::::@@@ :: :: : :: @ :#:
| @ :@@:::::@::::: ::::@ @ :: :: : :: @ :#:
| @ :::::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @ :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @:::::::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @@@@: : :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @ @: : :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @@ @: : :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @@ @: : :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @@ @: : :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| :@@ @: : :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @:@@ @: : :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @:@:@@ @: : :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
| @:@:@@ @: : :::::: :::::@ :: ::@::::: ::::@ @ :: :: : :: @ :#:
0 +----------------------------------------------------------------------->Gi
0 94.28
@zihaomu could you check memory leaks?
@zihaomu find the code and models in this issue: https://github.com/opencv/opencv/issues/23982 or even better here: https://github.com/opencv/opencv/issues/24041
I am currently checking the YOLO3 model with net.enableWinograd(false), but valgrind takes ages ...
Hi @asmorkalov, I will work on it.
@zihaomu: Here is the RAM graph for the YOLO3 code shown above with net.enableWinograd(false):
MB
793.6^ :
| :##:::@:::::::::::::::::::::@@:::::::::::::::@::::::::
| @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| ::@:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| :::: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| ::@@@: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| ::::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| : ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| : ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| @: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| @: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| @: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| @: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
| :@: ::: @ @: :: @:# :::@::: :: :: :: :: :: ::@ :: :: :: : ::: @: :: :::
0 +----------------------------------------------------------------------->Gi
0 278.1
As a hint: In this case I run several images through the same network. In the graphs shown before, I run one image through 4 networks. It appears that memory is not cleaned up after an inference run (as was the case for version 4.5.2). The memory is cleared when the network is destructed or another inference is triggered. That means that using valgrind to find the problem will not help.
I will run the above test case with Winograd enabled, but that will take a couple of hours.
Hi @ukoehler, how can I reproduce your issue? What I'm doing now is run the following code for an hour, and see if the memory is increasing or not.
Test environment: Mac intel, i9.
while(1)
{
std::cout<<"forward i = "<<i<<std::endl;
std::vector<cv::Mat> ret;
alexnetNet.forward(ret);
std::cout << ret.size() << " " << ret[0].rows << " " << ret[0].cols << std::endl;
googlenetNet.forward(ret);
std::cout << ret.size() << " " << ret[0].rows << " " << ret[0].cols << std::endl;
resnet152Net.forward(ret);
std::cout << ret.size() << " " << ret[0].rows << " " << ret[0].cols << std::endl;
vgg16Net.forward(ret);
std::cout << ret.size() << " " << ret[0].rows << " " << ret[0].cols << std::endl;
i++;
}
Hi @zihaomu ,
I do not think the while loop in necessary. I would only run through that once and see accumulating memory usage. Looks like memory is not cleaned after running forward(), but cleaned at destruction and the before the next forward() run.
I use valgrind -v --tool=massif executable_name
on Linux. That will create a massif.out.pid
file that can be view by calling ms_print massif_file
. It looks like your platform should be supported: https://valgrind.org/info/platforms.html
Looks like memory is not cleaned after running forward()
Hi @ukoehler, in opencv dnn, we will allocate memory for every layer by layer on the first run (for convolution weight re-packing and Winograd initializing). So, I think the accumulating memory usage is reasonable for the first run, it should not be a memory leak.
If the memory usage is still increasing during the following forward
that should be the memory leak.
We can not do the convolution weight re-packing and Winograd initializing at every beginning of forward
and deallocate memory at the end of forward. They are quite time-consuming. Based on my Mac M1 chip, the first run of Resnet50 takes about 50 ms
and the following run takes about 25 ms
.
For a big model, it dose not make sense to forward only with CPU. And such memory increasing aims to speed up forward. For small models, on ARM, we can speed up about 2X, test with ReseNet50.
As you can see from the graphs, that behaviour changed from version 4.5.2. The drastic increase in memory usage poses a big problem. The memory usage is not increasing when running the same net again, but is accumulating when running different nets one after the other. It also takes to long to reload the nets from disk every time. Something must have changed that causes that memory usage buildup. Memory should be freed at the end of a forward, or the possibility should at least exist.
As for speed: The changes make all the nets I tested dramatically slower, not faster. For regression alone, Winograd should not be on as a default. Add swapping due to massively increased memory usage (not in any of the cases above. These are just our unit tests) and we would have to seriously think about replacing OpenCV altogether (specially in the light of the other, obscure bugs I found)
but is accumulating when running different nets one after the other.
That's correct. We have found this issue, and try to fix it in the 5.x, which needs refactoring all our memory allocation strategy, a lot of work to do.
As for speed: The changes make all the nets I tested dramatically slower, not faster.
Have your test on ARM platform?
@ukoehler are your findings specific to a give backend or all of them?
@zihaomu , sorry for not specifying here:
googlenetNet.setPreferableBackend(cv::dnn::DNN_BACKEND_OPENCV);
googlenetNet.setPreferableTarget(cv::dnn::DNN_TARGET_CPU);
This is the only platform we target for inference, sorry.
net.forword()
running and without memory increansing.Hi @ukoehler, @asmorkalov. After running Forward of these models about 10000 times on CPU, I got the following image. And finally, the memory usage is about 2048 MB, you can find this in the image. I think we can remove the bug
tag.
Again, still a bug. Your test is not testing the problem at all. The memory is wasted while running several nets the first time not one net lots of times.
Hi @asmorkalov, please take a look. The reply from my site is https://github.com/opencv/opencv/issues/24134#issuecomment-1674474590.
The memory is wasted while running several nets the first time not one net lots of times.
That's true. Try to fix the future. For my point of view, this is optimization, not a memory leak.
Keeping the memory might be fine when only dealing with one network. I am dealing with six nets now and that will likely be increasing. The old behaviour work fine for that, the new behaviour leads to swapping due to the large amount of memory used. Is there a way to at least release memory to restore the old behaviour?
All in all, this is a massive regression problem without warning in the release notes. I am clearly not the only one having the problem as the two other issues about slower inference proof. Looks like the changes have a negative effect for many cases. So far I am just glad that out unit test caught the problem early.
The new behaviour leads to swapping due to the large amount of memory used.
Hi @ukoehler, currently we do not have such flag to do so. And in the future, we would like to add a new API like net.setMemoryUsage(DNN_MEMORY_LOW)
, so that users can choose the memory usage level in their code.
So far I am just glad that our unit test caught the problem early.
That's a good idea.
net.setMemoryUsage(DNN_MEMORY_LOW)
sound right with that possibly being the default option. We would need that solution rather quickly. For the time being we can stick to version 4.5.2, but there is one net I am training that requires at least 4.7 to load.
Significant memory footprint improvement: https://github.com/opencv/opencv/pull/25163
@asmorkalov, related to import stage only, but I probably take a look on inference too.
Yes, I know. The significant memory consumption is caused by Winograd branch in convolution.
System Information
OpenCV version 4.8.0 vs. 4.5.2 compiled from source Operating System: Both Windows and Linux Compiler GCC 11
Detailed description
During regression testing between version 4.8.0 and 4.5.2 most regression tests ran perfect. All results were the same, however the runtime with larger data sets increased dramatically. For smaller dataset I maybe see a slightly longer runtime. However, it turns out that swapping was the problem for larger datasets.
I used valgrind massif to measure memory usage for a singled out unit test with one of the networks and noticed that 4.5.2 used 553,8 MB of RAM while version 4.8.0 needed 1.89 GB of RAM for the same task.
Find the network data here: https://drive.google.com/file/d/1hPetNOt76xze8cD3DriDuZira65IeKc4/view?usp=sharing and https://drive.google.com/file/d/1swyevb0xsQhKeFxHOz-G00Dl-mz2wVCD/view?usp=sharing
As requested in issue https://github.com/opencv/opencv/issues/23223 I provide the performance data (there was NO SWAPPING) for 4.8.0:
and 4.5.2
Steps to reproduce
Issue submission checklist