[Bug]: IR Models with dynamic dimensions continuously increase memory during inference

rememberBr commented 8 months ago

OpenVINO Version

2023.2，2022.1，2023.3，2024.0

Operating System

Windows System

Device used for inference

CPU

Framework

None

Model used

No response

Issue description

I have an ONNX model with dynamic dimensions. Use the mo tool to convert to IR format, and the conversion command is: 1) Mo -- input_model cnn.onnx -- output_dir cnnd32 2) Mo -- input_model cnn.onnx -- output_dir cnnd16 --compress_to_fp16 3) Mo -- input_model cnn.onnx -- output_dir cnns32 --compress_to_fp16 --input_shape (1, 1, 64, 64) -- static_shape When using the cnnd32 and cnnd16 models for inference, and continuously inserting images of different sizes during repeated inference processes, the memory usage will become higher and higher (4 thousands of inferences can reach several GB).

If images of the same size are continuously inserted into cnnd32, cnnd16, and cnns32 during repeated reasoning processes, the memory will only occupy approximately 40MB.

The mo version of the conversion model has been used: 1) Version of Model Optimizer is: 2024.0.0-14509-34 caeefd078 releases/2024/0 2) Version of Model Optimizer is: 2023.1.0-12185-9e6b00e51cd releases/2023/1

Openvino runtime versions have been used : 2023.2，2022.1，2023.3，2024.0 , all of which were passed through [https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html] download

The heap analysis results obtained through vs2019:

my code and model: main.zip

After checking the previous issues, I found similar issues. It is said that these bugs have been fixed, but even if I use the latest version 2024.0, the problem still exists. What is there for me to do?

Step-by-step reproduction

No response

Relevant log output

No response

Issue submission checklist

[X] I'm reporting an issue. It's not a question.
[X] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
[X] There is reproducer code and related data files such as images, videos, models, etc.

vurusovs commented 8 months ago

@rememberBr hello! It's expected that for dynamic shapes memory is growing because different size requires different (reallocated) memory amount. But there should be a limit when no new memory required to manage all the shapes. Could you limit images size range for e.g. HxW as [200-220] x [200-220]? It will help to determine is here contiguous memory leak or not

Could you also share the model so we will be able to check it on our side?

In addition, you may take a look on issue https://github.com/openvinotoolkit/openvino/issues/20633 which should be fixed soon

rememberBr commented 8 months ago

@vurusovs

Greetings, the minimum input size I can handle is 30x30, while the maximum allowed size is 200x200. The test images I work with also fall within this range. Although a slight increase in memory usage may be acceptable, experiencing a jump from 40MB to 3GB in just two minutes poses significant risks.

It appears that the issue lies in reallocating memory when the new input shape does not match the previous one, without freeing up the previous memory allocation.

You can find my model and test code stored in main.zip under "Step-by-step reproduction" at the following link: https://github.com/openvinotoolkit/openvino/files/14519919/main.zip

Thank you for your assistance

utorik45 commented 5 months ago

Any update? I am under same situation.

v-Golubev commented 5 months ago

@rememberBr @utorik45 hello, recently we have merged memory leak fix in master branch. Could you please try master branch starting from b886fa5 commit? The fix will be also included in 2024.2 OpenVINO release

rememberBr commented 5 months ago

@rememberBr @utorik45 hello, recently we have merged memory leak fix in master branch. Could you please try master branch starting from b886fa5 commit? The fix will be also included in 2024.2 OpenVINO release

That's great! If I want to try the master branch, do I need to compile OpenVINO.dll from the source code? And when will the version 2024.2 be released?

utorik45 commented 5 months ago

@rememberBr @utorik45 hello, recently we have merged memory leak fix in master branch. Could you please try master branch starting from b886fa5 commit? The fix will be also included in 2024.2 OpenVINO release

That's great! If I want to try the master branch, do I need to compile OpenVINO.dll from the source code? And when will the version 2024.2 be released?

I compiled the master branch of OpenVino and made use of OpenVINO.dll files.

utorik45 commented 5 months ago

I tested the master branch of the current OpenVINO, but the memory still increases during inference with a dynamic shape model (Int8, FP16). @v-Golubev, how did your test of the master branch go?

v-Golubev commented 5 months ago

That's great! If I want to try the master branch, do I need to compile OpenVINO.dll from the source code?

@rememberBr Yes, the only option for now is to compile OpenVINO from the source code.

And when will the version 2024.2 be released?

@moslex could you please help to answer? Thanks.

v-Golubev commented 5 months ago

I tried the cnn.onnx model on the current master in the following scenario:

the model is compiled with dynamic shapes
height and weight size were generated randomly in [30-200] range

I got the following memory consumption:

As we can see, memory consumption stops growing starting from ~10000th inference: contiguous memory leak is not visible.

rememberBr commented 5 months ago

I tried the cnn.onnx model on the current master in the following scenario:

the model is compiled with dynamic shapes

height and weight size were generated randomly in [30-200] range

I got the following memory consumption:

As we can see, memory consumption stops growing starting from ~10000th inference: contiguous memory leak is not visible.

I have compiled the latest DLL from the master branch, using version 2024.0 of the header files. The test results have left me somewhat perplexed. When I set the random resize range to [50, 60], the memory increased to 600MB within a short period and then ceased growing. Setting the range to [30, 200] resulted in a memory increase to 1800MB before halting. Further setting it to [30, 400] led to a memory increase of up to 3.2GB in 40,000 tests (which I did not continue). The memory growth curve during testing closely resembled yours: rapid initial growth followed by slower increments later on.

Based on my testing, it appears that the extent of memory increase is positively correlated with the range of input image sizes. I am uncertain whether there may be an issue with my test code or with the DLL I compiled. Upon release of version 2024.2.0, I intend to conduct further testing.

v-Golubev commented 5 months ago

@rememberBr the behavior you described is expected. Positive correlation between range of input image sizes and memory consumption has at least one more reason except memory reallocation: this is a primitives caching.

Some primitives, which are used during inference, are not shape agnostic. This means that we need to recompile these primitives each time when input shapes are changed. Since recompilation is quite expensive process from performance perspective, CPU plugin caches the compiled primitives. The cached primitives obviously contribute to memory consumption, but positively impact inference performance.

Users can regulate primitives cache capacity via CPU_RUNTIME_CACHE_CAPACITY property. For example, we can disable primitives caching by setting the capacity to 0:

core.set_property(device_name, {{"CPU_RUNTIME_CACHE_CAPACITY", "0"}})

This can be done if memory consumption is crucial in your scenario, but please note that this negatively affects performance in most cases. If the caching is disabled, the cnn.onnx model shows much lower memory consumption. The results for [30-200] range are below:

rememberBr commented 5 months ago

@v-Golubev Thank you for your response. I understand what you mean now. I noticed that you mentioned "Users can regulate primitives cache capacity via CPU_RUNTIME_CACHE_CAPACITY property." And when CPU_RUNTIME_CACHE_CAPACITY is set to 0, memory growth is almost completely stifled in my tests as well. I would like to inquire whether CPU_RUNTIME_CACHE_CAPACITY can be set to other values, in units of Mb or G, to limit the maximum memory growth? I couldn't find related documentation, but when I tried setting CPU_RUNTIME_CACHE_CAPACITY to 300, memory usage continued to grow beyond 300Mb.

v-Golubev commented 5 months ago

@rememberBr CPU_RUNTIME_CACHE_CAPACITY is an internal CPU plugin property, which defines how many records can be stored in the CPU runtime parameters cache per one execution stream. It is documented only in code. Please note: since the property is internal, backward compatibility is not guaranteed. However, it can be easily used in current OV versions.

I would like to inquire whether CPU_RUNTIME_CACHE_CAPACITY can be set to other values, in units of Mb or G, to limit the maximum memory growth?

Unfortunately, it's impossible to regulate cpu runtime cache capacity basing on maximum allowed memory consumption. As I wrote before, we can regulate only maximal number of records which can be stored in the cache. Default capacity is 5000 -- in this scenario I got ~900Mb max memory consumption in my previous experiments. When I set capacity to 0 (the cache is disabled) -- I got ~180Mb. So the optimal property value in your specific scenario can be adjusted experimentally.

rememberBr commented 5 months ago

@v-Golubev Does the capacity of CPU_RUNTIME_CACHE_CAPACITY approximately correspond to the number of shape information entries? For example, when my image sizes range between 80-100, there are a total of 400 different possible shapes. Does this mean that at most 400 additional cache entries will be generated in this case? If the allowed range is exceeded, does the system stop caching, or does it discard the earliest cache entries to make room for new ones?

In my tests, setting CPU_RUNTIME_CACHE_CAPACITY to any value other than 0 seems to make no difference; it results in almost the same memory consumption. For quick testing, I randomly scaled images to sizes between 80-100, then compared the effects of setting CPU_RUNTIME_CACHE_CAPACITY to 200 versus not setting it at all. The final memory consumption was almost identical in both cases.

v-Golubev commented 5 months ago

Does the capacity of CPU_RUNTIME_CACHE_CAPACITY approximately correspond to the number of shape information entries? For example, when my image sizes range between 80-100, there are a total of 400 different possible shapes. Does this mean that at most 400 additional cache entries will be generated in this case?

Cpu primitives cache stores primitives which corresponds to some layers (for example, Convolutions and MaxPools in your model). So maximal amount of generated cache entries in your example will be 400 * N, where N -- unique primitives count in execution graph. Please also note that N might be less than total number of all Convs/Poolings in the model because if several layers have the same configuration (input shapes and layer attributes), they will reuse one cache entry.

If the allowed range is exceeded, does the system stop caching, or does it discard the earliest cache entries to make room for new ones?

When capacity is exceeded, the cache discards the cache entries that have not been used the longest to free space for new entries.

In my tests, setting CPU_RUNTIME_CACHE_CAPACITY to any value other than 0 seems to make no difference; it results in almost the same memory consumption. For quick testing, I randomly scaled images to sizes between 80-100, then compared the effects of setting CPU_RUNTIME_CACHE_CAPACITY to 200 versus not setting it at all. The final memory consumption was almost identical in both cases.

Is it possible to provide the specific data that you got?

One more point: I noticed that you use only one infer request in your reproducer. Do I understand correctly that your scenario is latency oriented? If yes, it's better to compile model with PERFORMANCE_HINT=LATENCY (additional details can be found in the link I shared) -- this can help you with memory consumption as well. By default, the model is compiled in throughput mode.

rememberBr commented 5 months ago

@v-Golubev Thank you very much for your answer and the advice given at the end.

Cpu primitives cache stores primitives which corresponds to some layers (for example, Convolutions and MaxPools in your model). So maximal amount of generated cache entries in your example will be 400 * N, where N -- unique primitives count in execution graph. Please also note that N might be less than total number of all Convs/Poolings in the model because if several layers have the same configuration (input shapes and layer attributes), they will reuse one cache entry.

If CPU_RUNTIME_CACHE_CAPACITY also involves the primitive cache of the model itself, it seems that it cannot be set simply based on the number of possible shapes, but requires multiple tests to obtain an empirical value?

Is it possible to provide the specific data that you got?

I apologize for the confusion regarding this test. It appears that core.set_property should be placed before core.compile_model. I didn't do this before, so the settings didn't take effect. This is likely the reason why the memory usage results are the same. In the latest settings, when CPU_RUNTIME_CACHE_CAPACITY is 200, after 4000 inferences, I see through the Windows 10 task manager that the compiled infer.exe ultimately consumes 146Mb of memory. Under the same conditions, when CPU_RUNTIME_CACHE_CAPACITY is 0, it consumes 394Mb of memory, and when CPU_RUNTIME_CACHE_CAPACITY is 1000, it consumes 206Mb of memory. It appears that the setting of CPU_RUNTIME_CACHE_CAPACITY is effective. I will close this issue and test again in the future 2024.2 openvino Release.

Thank you again for your help and I wish you all the best.

v-Golubev commented 5 months ago

@rememberBr You are welcome :)

If CPU_RUNTIME_CACHE_CAPACITY also involves the primitive cache of the model itself, it seems that it cannot be set simply based on the number of possible shapes, but requires multiple tests to obtain an empirical value?

That's right.

openvinotoolkit / openvino