Closed thhart closed 3 years ago
valgrind
. That's what I was using to check for leaks when developing it.
Are you closing all the input and output OnnxValue
s after you've finished with them?
The code looks like this below. list is a float array containing an image. As you can see the close of all resources is commented out since these are auto closed, if I try to close these manually an exception is thrown. I double checked all resources which might leak but could not find something.
final ArrayListboxes = new ArrayList<>(); final FloatBuffer floatBuffer = FloatBuffer.wrap(list); final HashMap inputs = new HashMap<>(); OnnxTensor tensor = OnnxTensor.createTensor(environment, floatBuffer, new long[] {1, 3, dimX, dimY}); //noinspection EmptyFinallyBlock try { //noinspection resource inputs.put("images", tensor); final Optional output; @SuppressWarnings("resource") Result result = session.run(inputs, (RunOptions)null); //noinspection EmptyFinallyBlock try { tensor.close(); output = result.get("output"); // Color[] colors = HelperColor.randomColor(2); if (output.isPresent()) { final float[][][] v0; try (OnnxValue value = output.get()) { v0 = (float[][][]) value.getValue(); final ArrayList detections = HelperRecognition.nms(extract(v0[0]), labels, mNmsThresh); for (Recognition d : detections) { d.scaleUp(sX,sY); } boxes.addAll(detections); } } else { throw new RuntimeException("no result to read"); } } finally { // System.err.println("Not free result..."); // result.close(); } inputs.clear(); } finally { // System.err.println("Not free tensor..."); // tensor.close(); } return boxes;
As soon as I try to free the result the JVM crashes with following error: free(): double free detected in tcache 2
The model only has a single output? It's preferable to use a try with resources on the result object itself, that will ensure that all the values returned by the model are closed.
I thought that the OnnxTensor had a guard on its close method to prevent the double free. Guess not, I'll put one in at some point in the next few weeks.
Due to the way you've scoped the input tensor (i.e. it's outside of the try block) it (and it's backing direct ByteBuffer) will stick around after the close call has been made. In your real program is this object likely to escape?
The model is a Yolo model which has one complex structured output.
The input tensor is in a try-resource block now but still same problem.
The result can not be in a resource due to the double free problem, as soon as I do this it crashes.
Below is the updated code but memory increasing steadily:
final ArrayListboxes = new ArrayList<>(); final FloatBuffer floatBuffer = FloatBuffer.wrap(list); final HashMap inputs = new HashMap<>(); try (OnnxTensor tensor = OnnxTensor.createTensor(environment, floatBuffer, new long[]{1, 3, dimX, dimY})) { inputs.put("images", tensor); final Optional output; Result result = session.run(inputs, options); try { output = result.get("output"); // Color[] colors = HelperColor.randomColor(2); if (output.isPresent()) { final float[][][] v0; try (OnnxValue value = output.get()) { v0 = (float[][][])value.getValue(); final ArrayList detections = HelperRecognition.nms(extract(v0[0]), labels, mNmsThresh); for (Recognition d : detections) { d.scaleUp(sX, sY); } boxes.addAll(detections); } } else { throw new RuntimeException("no result to read"); } } finally { inputs.clear(); // System.err.println("Not free result..."); // result.close(); } }
If you recompile ONNX Runtime with debug symbols then valgrind should be able to show you where the lost memory is being allocated. Then we can try and run down what's going on. I wouldn't run a big thing through it though as valgrind is slow (and CPU only).
Thanks for the hint, will check further and come back if I have more information. Valgrind is too slow to check. Since the library is woven into a complex environment I can not be sure yet, will modularize into different JVM first.
If possible check the model using one of the other interfaces (e.g. Python), and see if you observe similar behaviour. If not then at least it's narrowed down to where the Java interface allocates memory in native code.
Recently, I observe a problem with onnxruntime in Java. The memory of my java code with onnx has increased slowly but never stop. I try my best to figure out the memory increase problem. However, I only find some problem happen in libonnxruntime.so
.
Do you have figure the problem except reboot? @thhart
Could you open another issue with more details (e.g. platform, session options/providers, ORT version, JVM version)?
@cgpeter96 The problem is within native code and it is not possible to diagnose within Java, I have modularized my code in a way that analyzers utilizing Onnx restart itself after a specific time.
So there's a memory leak inside the ORT native runtime? That seems like something that should get fixed. Is it still occurring in the latest version?
So there's a memory leak inside the ORT native runtime? That seems like something that should get fixed. Is it still occurring in the latest version?
No, I fix the problem. I forget close of OnnxTensor
. Just apply try-catch
on create tensor and run session.
--------------- S Y S T E M ---------------
OS: Windows 10 , 64 bit Build 19041 (10.0.19041.292) OS uptime: 6 days 4:30 hours
CPU: total 8 (initial active 8) (4 cores per cpu, 2 threads per core) family 6 model 142 stepping 12 microcode 0xde, cx8, cmov, fxsr, ht, mmx, 3dnowpref, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, lzcnt, tsc, tscinvbit, avx, avx2, aes, erms, clmul, bmi1, bmi2, adx, fma, vzeroupper, clflush, clflushopt
Memory: 4k page, system-wide physical 7785M (415M free) TotalPageFile size 20585M (AvailPageFile size 5M) current process WorkingSet (physical memory assigned to process): 3566M, peak: 3796M current process commit charge ("private bytes"): 13863M, peak: 13864M
vm_info: Java HotSpot(TM) 64-Bit Server VM (17.0.2+8-LTS-86) for windows-amd64 JRE (17.0.2+8-LTS-86), built on Dec 7 2021 21:51:03 by "mach5one" with MS VC++ 16.8 / 16.9 (VS2019)
tried try-catch on create tensor and run session and get the results , but still out of memory
I run a for loop about 2000 times,
start onnx env and session only once and close the onnx results mannualy, it will take more time to throw this error
than one time one loop using try-catch
but finnaly about finished 1980 loop, it throw the error again.
may be
So there's a memory leak inside the ORT native runtime? That seems like something that should get fixed. Is it still occurring in the latest version?
Can you provide details about your model (inputs, outputs etc), and show the code you're using to drive the inference (including the session creation and any session options)? I assume this is using ORT 1.10?
sorry, I find that problem is about djl NDManager,the manager did not closed,the memory leak. After I use submanager to close the NDArray, it is fine.
I am facing a memory leak problem using Onnx runtime. The Java virtual memory keeps slowly growing until it gets extremely high after several hundreds analyzes with an object detection model until the memory is exhausted. The standard Java heap keeps in its bounds (16 GB) in this time. So it is happening in the native section. Unfortunately I am not aware of analyze the outer heap memory of Java.
I use a standard Onnx session without any special parameters.
I already tried to compile without OpenMP support however the engine is nearly unusable then. (10x slower) so I could not reproduce with satisfaction.
It is a Linux 64 system. Onnx from standard Maven distribution with 1.5.2 as well as 1.4.0 tested.
Use a new session does not help.
Can you give an hint how to track the off heap memory occupation?