microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.58k stars 2.92k forks source link

Possible memory leak #6058

Closed thhart closed 3 years ago

thhart commented 3 years ago

I am facing a memory leak problem using Onnx runtime. The Java virtual memory keeps slowly growing until it gets extremely high after several hundreds analyzes with an object detection model until the memory is exhausted. The standard Java heap keeps in its bounds (16 GB) in this time. So it is happening in the native section. Unfortunately I am not aware of analyze the outer heap memory of Java.

I use a standard Onnx session without any special parameters.

I already tried to compile without OpenMP support however the engine is nearly unusable then. (10x slower) so I could not reproduce with satisfaction.

It is a Linux 64 system. Onnx from standard Maven distribution with 1.5.2 as well as 1.4.0 tested.

Use a new session does not help.

Can you give an hint how to track the off heap memory occupation?

Craigacp commented 3 years ago

valgrind. That's what I was using to check for leaks when developing it.

Are you closing all the input and output OnnxValues after you've finished with them?

thhart commented 3 years ago

The code looks like this below. list is a float array containing an image. As you can see the close of all resources is commented out since these are auto closed, if I try to close these manually an exception is thrown. I double checked all resources which might leak but could not find something.

  
      final ArrayList boxes = new ArrayList<>();
      final FloatBuffer floatBuffer = FloatBuffer.wrap(list);
      final HashMap inputs = new HashMap<>();
      OnnxTensor tensor =
          OnnxTensor.createTensor(environment, floatBuffer, new long[] {1, 3, dimX, dimY});
      //noinspection EmptyFinallyBlock
      try {
        //noinspection resource
        inputs.put("images", tensor);
        final Optional output;
        @SuppressWarnings("resource")
        Result result = session.run(inputs, (RunOptions)null);
        //noinspection EmptyFinallyBlock
        try {
          tensor.close();
          output = result.get("output");
          // Color[] colors = HelperColor.randomColor(2);
          if (output.isPresent()) {
            final float[][][] v0;
            try (OnnxValue value = output.get()) {
              v0 = (float[][][]) value.getValue();
              final ArrayList detections = HelperRecognition.nms(extract(v0[0]), labels, mNmsThresh);
              for (Recognition d : detections) {
                d.scaleUp(sX,sY);
              }
              boxes.addAll(detections);
            }
          } else {
            throw new RuntimeException("no result to read");
          }
        } finally {
          // System.err.println("Not free result...");
          // result.close();
        }
        inputs.clear();
      } finally {
        // System.err.println("Not free tensor...");
        // tensor.close();
      }
      return boxes;
thhart commented 3 years ago

As soon as I try to free the result the JVM crashes with following error: free(): double free detected in tcache 2

Craigacp commented 3 years ago

The model only has a single output? It's preferable to use a try with resources on the result object itself, that will ensure that all the values returned by the model are closed.

I thought that the OnnxTensor had a guard on its close method to prevent the double free. Guess not, I'll put one in at some point in the next few weeks.

Due to the way you've scoped the input tensor (i.e. it's outside of the try block) it (and it's backing direct ByteBuffer) will stick around after the close call has been made. In your real program is this object likely to escape?

thhart commented 3 years ago

The model is a Yolo model which has one complex structured output.

The input tensor is in a try-resource block now but still same problem.

The result can not be in a resource due to the double free problem, as soon as I do this it crashes.

Below is the updated code but memory increasing steadily:

     final ArrayList boxes = new ArrayList<>();
      final FloatBuffer floatBuffer = FloatBuffer.wrap(list);
      final HashMap inputs = new HashMap<>();
      try (OnnxTensor tensor = OnnxTensor.createTensor(environment, floatBuffer, new long[]{1, 3, dimX, dimY})) {
        inputs.put("images", tensor);
        final Optional output;
        Result result = session.run(inputs, options);
          try {
            output = result.get("output");
            // Color[] colors = HelperColor.randomColor(2);
            if (output.isPresent()) {
              final float[][][] v0;
              try (OnnxValue value = output.get()) {
                v0 = (float[][][])value.getValue();
                final ArrayList detections = HelperRecognition.nms(extract(v0[0]), labels, mNmsThresh);
                for (Recognition d : detections) {
                  d.scaleUp(sX, sY);
                }
                boxes.addAll(detections);
              }
            } else {
              throw new RuntimeException("no result to read");
            }
          } finally {
            inputs.clear();
            // System.err.println("Not free result...");
            // result.close();
          }
      }
Craigacp commented 3 years ago

If you recompile ONNX Runtime with debug symbols then valgrind should be able to show you where the lost memory is being allocated. Then we can try and run down what's going on. I wouldn't run a big thing through it though as valgrind is slow (and CPU only).

thhart commented 3 years ago

Thanks for the hint, will check further and come back if I have more information. Valgrind is too slow to check. Since the library is woven into a complex environment I can not be sure yet, will modularize into different JVM first.

Craigacp commented 3 years ago

If possible check the model using one of the other interfaces (e.g. Python), and see if you observe similar behaviour. If not then at least it's narrowed down to where the Java interface allocates memory in native code.

cgpeter96 commented 2 years ago

Recently, I observe a problem with onnxruntime in Java. The memory of my java code with onnx has increased slowly but never stop. I try my best to figure out the memory increase problem. However, I only find some problem happen in libonnxruntime.so.

Do you have figure the problem except reboot? @thhart

Craigacp commented 2 years ago

Could you open another issue with more details (e.g. platform, session options/providers, ORT version, JVM version)?

thhart commented 2 years ago

@cgpeter96 The problem is within native code and it is not possible to diagnose within Java, I have modularized my code in a way that analyzers utilizing Onnx restart itself after a specific time.

Craigacp commented 2 years ago

So there's a memory leak inside the ORT native runtime? That seems like something that should get fixed. Is it still occurring in the latest version?

cgpeter96 commented 2 years ago

So there's a memory leak inside the ORT native runtime? That seems like something that should get fixed. Is it still occurring in the latest version?

No, I fix the problem. I forget close of OnnxTensor. Just apply try-catch on create tensor and run session.

shuiyuejihua commented 2 years ago

--------------- S Y S T E M ---------------

OS: Windows 10 , 64 bit Build 19041 (10.0.19041.292) OS uptime: 6 days 4:30 hours

CPU: total 8 (initial active 8) (4 cores per cpu, 2 threads per core) family 6 model 142 stepping 12 microcode 0xde, cx8, cmov, fxsr, ht, mmx, 3dnowpref, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, lzcnt, tsc, tscinvbit, avx, avx2, aes, erms, clmul, bmi1, bmi2, adx, fma, vzeroupper, clflush, clflushopt

Memory: 4k page, system-wide physical 7785M (415M free) TotalPageFile size 20585M (AvailPageFile size 5M) current process WorkingSet (physical memory assigned to process): 3566M, peak: 3796M current process commit charge ("private bytes"): 13863M, peak: 13864M

vm_info: Java HotSpot(TM) 64-Bit Server VM (17.0.2+8-LTS-86) for windows-amd64 JRE (17.0.2+8-LTS-86), built on Dec 7 2021 21:51:03 by "mach5one" with MS VC++ 16.8 / 16.9 (VS2019)

tried try-catch on create tensor and run session and get the results , but still out of memory

I run a for loop about 2000 times,

start onnx env and session only once and close the onnx results mannualy, it will take more time to throw this error than one time one loop using try-catch but finnaly about finished 1980 loop, it throw the error again.
may be

So there's a memory leak inside the ORT native runtime? That seems like something that should get fixed. Is it still occurring in the latest version?

Craigacp commented 2 years ago

Can you provide details about your model (inputs, outputs etc), and show the code you're using to drive the inference (including the session creation and any session options)? I assume this is using ORT 1.10?

shuiyuejihua commented 2 years ago

sorry, I find that problem is about djl NDManager,the manager did not closed,the memory leak. After I use submanager to close the NDArray, it is fine.