openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
6.72k stars 2.16k forks source link

[Bug]: Endless memory usage growth with dynamic shapes #20633

Open notaz opened 10 months ago

notaz commented 10 months ago

OpenVINO Version

2023.1.0

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

CPU

Framework

None

Model used

Mask R-CNN

Issue description

We need to infer using a custom model with unlimited amount of input shapes that are not known in advance from multiple threads. Dynamic shape support is used for that purpose. Large amount (but not unlimited if course) of RAM is available, so a separate model is compiled for each thread. I'm well aware of larger memory usage for dynamic shapes, however as the program runs the memory usage never seems to stop growing. I'm aware about the CPU_RUNTIME_CACHE_CAPACITY parameter, but setting it to 0 doesn't seem to be helping here.

I can't share the model, however I was able to reproduce this with Mask R-CNN public model.

On a 72 thread machine, if a fixed shape of [3,1024,1024] is repeatedly used, the memory usage peaks at around 76.3G. With random shapes it reaches 128G in 2-3 minutes, 192G in ~1h and keeps growing at a rate of ~600M/minute rss and ~7G/minute virtual. The amount of mappings (/proc/[pid]/maps) is also growing, many new regions have x permission set so it's probably some generated code cache that keeps growing.

Also happens in master branch as of 755651cd34c6c6195039fd17dbc3eac1619e2a88.

Step-by-step reproduction

Run the following program, preferable on a machine with large amount of CPU cores. Observe memory usage growing endlessly with top or another tool of your choice.

#include <sstream>
#include <string>
#include <thread>
#include "openvino/openvino.hpp"

static ov::Core core;
static std::shared_ptr<ov::Model> model;
static std::string device_name = "CPU";

static void StartThreadForRun() {
        auto compiled_model = core.compile_model(model, device_name);
        auto xrand = []() {
                unsigned long align = 32;
                return (rand() % (2048-align) & ~(align-1)) + align;
        };
        for (;;) {
                ov::InferRequest request = compiled_model.create_infer_request();
                auto shape = ov::Shape{3, xrand(), xrand()};
                std::cout << std::this_thread::get_id() << " " << "shape: " << shape << std::endl;
                request.get_input_tensor().set_shape(shape);
                request.infer();
        }
}

int main() {
        core.set_property(device_name, ov::streams::num(0));
        core.set_property(device_name, ov::affinity(ov::Affinity::NONE));
        //core.set_property(device_name, {{"CPU_RUNTIME_CACHE_CAPACITY", "0"}});
        model = core.read_model("MaskRCNN-12/MaskRCNN-12.onnx");

        std::vector<std::thread> threads;
        for (unsigned int i = 0; i < std::thread::hardware_concurrency(); i++)
                threads.emplace_back(StartThreadForRun);
        for (auto &t : threads)
                t.join();
}

Relevant log output

No response

Issue submission checklist

vurusovs commented 10 months ago

I confirm memory leak, it's reproducible w/o threads execution on CPU. Please, find code snippet below:

#include <string>
#include <thread>
#include "openvino/openvino.hpp"
#include "openvino/pass/serialize.hpp"

size_t getSystemDataByName(char* name) {
    auto parseLine = [](std::string line) -> size_t {
        std::string res = "";
        for (auto c : line)
            if (isdigit(c))
                res += c;
        if (res.empty())
            throw std::runtime_error("Can't get system memory values");
        return std::stoul(res);
    };

    FILE* file = fopen("/proc/self/status", "r");
    size_t result = 0;
    bool status = false;
    if (file != nullptr) {
        char line[128];

        while (fgets(line, 128, file) != NULL) {
            if (strncmp(line, name, strlen(name)) == 0) {
                result = parseLine(line);
                status = true;
                break;
            }
        }
        fclose(file);
    }
    if (!status)
        throw std::runtime_error("Can't get system memory values");
    return result;
}

size_t getVmRSSInKB() {
    return getSystemDataByName(const_cast<char*>("VmRSS:"));
}

ov::Core core;
std::shared_ptr<ov::Model> model;
std::string device_name = "CPU";
auto xrand = []() {
        unsigned long align = 32;
        return (rand() % (2048-align) & ~(align-1)) + align;
};

void StartThreadForRun() {
        auto compiled_model = core.compile_model(model, device_name);
        for (;;) {
                ov::InferRequest request = compiled_model.create_infer_request();
                request.infer();
                std::cout << std::this_thread::get_id() << "  " << getVmRSSInKB() << std::endl;
        }
}

int main() {
        model = core.read_model("./MaskRCNN-12.onnx");
        ov::serialize(model, "./MaskRCNN-12.xml", "./MaskRCNN-12.bin");
        model = core.read_model("./MaskRCNN-12.xml", "./MaskRCNN-12.bin");
        model->reshape(ov::Shape{3, xrand(), xrand()});

        StartThreadForRun();
}

It produces next results: image

vurusovs commented 10 months ago

Ticket ref. #123640

v-Golubev commented 3 months ago

@notaz hello, recently we have merged memory leak fix in master branch. Could you please try master branch starting from b886fa5 commit? The fix will be also included in 2024.2 OpenVINO release

notaz commented 3 months ago

Singlethreaded use appears to be fixed.

However multithreaded use (like in the program from my first post) still seems to growing memory indefinitely. What's worse, it's now crashing, got one crash after more than 3 hours of running the test, and one after several minutes.

Thread 66 "test2" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffed2ffd6c0 (LWP 2634349)]
0x00007fffec745492 in dnnl_post_ops::entry_t::is_binary() const () from /root/test/openvino_master_relwdbg/runtime/lib/intel64/libopenvino_intel_cpu_plugin.so
(gdb) bt
#0  0x00007fffec745492 in dnnl_post_ops::entry_t::is_binary() const () from /root/test/openvino_master_relwdbg/runtime/lib/intel64/libopenvino_intel_cpu_plugin.so
#1  0x00007fffed6656ca in dnnl::impl::cpu::binary_injector_utils::extract_bcast_strategies(std::vector<dnnl_post_ops::entry_t, std::allocator<dnnl_post_ops::entry_t> > const&, dnnl::impl::memory_desc_wrapper const&) () from /root/test/openvino_master_relwdbg/runtime/lib/intel64/libopenvino_intel_cpu_plugin.so
#2  0x00007fffecb87e68 in decltype (make_tuple(({parm#3},(false))...)) dnnl::impl::cpu::binary_injector_utils::bcast_strategies_present_tup<dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t>(std::vector<dnnl_post_ops::entry_t, std::allocator<dnnl_post_ops::entry_t> > const&, dnnl::impl::memory_desc_wrapper const&, dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t, dnnl::impl::broadcasting_strategy_t) ()
   from /root/test/openvino_master_relwdbg/runtime/lib/intel64/libopenvino_intel_cpu_plugin.so
#3  0x00007fffecb78c76 in dnnl::impl::cpu::x64::jit_brgemm_amx_uker_base_t::jit_brgemm_amx_uker_base_t(dnnl::impl::cpu::x64::brgemm_t const&) ()
...
#68 0x00007ffff6e746ca in ov::InferRequest::infer() () from ../openvino_master_relwdbg/runtime/lib/intel64/libopenvino.so

Setting CPU_RUNTIME_CACHE_CAPACITY to 0 seems to help to reproduce the crash faster.

v-Golubev commented 3 months ago

@notaz thank you for the quick feedback. We will take a look at the issue

v-Golubev commented 2 months ago

@notaz We have found a root-cause of the crash: it is connected with primitives cache in third-party library. While we continue the investigation and work on a fix, I can suggest to workaround the issue: the mentioned primitives cache can be disabled via environment variable DNNL_PRIMITIVE_CACHE_CAPACITY -- if it is set to 0, the faulty cache is disabled. Also, this could positively impact memory consumption growth.

Also, I have several questions regarding your reproducer.

  1. The attached Mask-RCNN model contains operations whose output shapes depend not only on input shape but on input values as well. This means that model input shapes range is not the only parameter which impacts max memory consumption. Do you know if your original model (which can't be shared) contains such ops (e.g. NonZero) as well?
  2. Could you please explain why do you create CompiledModel for each thread? Is it done to maximize throughput or for something else? Is it possible to create a single CompiledModel for all threads, but set number of streams to the thread concurrency value? This approach is more preferable in throughput maximization scenario. If my guess regarding scenario is wrong, please let me know.
notaz commented 2 months ago

Thanks. I no longer encountered a crash, but haven't noticed an improvement with memory consumption.

As for the answers:

  1. I was told it does not.
  2. It is for throughput. Unlike in a synthetic benchmark, inference is not the only work that needs to be done. The work is split for hundreds of threads running on hundreds of CPU cores on a large server. Once a thread has prepared the data to infer it has to pass this to the inference engine, wait for the result and then do further processing. Doing this though a single model creates lock contention bottleneck. It's also useless context switching between our thread pool and inference engine's. It's just more efficient to do it on a dedicated thread with it's own copy of everything and without locks, and any kind of locking or blocking is bottlenecking the system more and more as the core count goes up.