tensorflow / java

Java bindings for TensorFlow
Apache License 2.0
813 stars 200 forks source link

The Java Tensorflow GPU library has a memory leak. #343

Open x5w46fxdx opened 3 years ago

x5w46fxdx commented 3 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_memory_leak

System information

Describe the current behavior CPU version: No memory leak.(tensorflow-core-platform-cpu:0.3.1) GPU version: memory leak occurred. (tensorflow-core-platform-gpu:0.3.1)

No changes have been made to the code. In each round of execution, the GPU library did not release the memory correctly, and the memory continued to increase until an exception occurred.

Describe the expected behavior

Code to reproduce the issue The most basic graph operations.

Other info / logs 2021-06-21 03:37:32.085884: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-06-21 03:37:32.252155: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-06-21 03:37:32.288428: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-21 03:37:32.289228: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:03:00.0 name: GeForce RTX 3070 Ti computeCapability: 8.6 coreClock: 1.77GHz coreCount: 48 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 566.30GiB/s 2021-06-21 03:37:32.289318: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-06-21 03:37:32.292972: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-06-21 03:37:32.293014: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-06-21 03:37:32.294163: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-06-21 03:37:32.294452: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-06-21 03:37:32.297755: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-06-21 03:37:32.298788: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-06-21 03:37:32.298964: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-06-21 03:37:32.299075: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-21 03:37:32.299705: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-21 03:37:32.300236: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-06-21 03:37:32.300275: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-06-21 03:37:32.913810: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-06-21 03:37:32.913854: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-06-21 03:37:32.913866: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-06-21 03:37:32.914029: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-21 03:37:32.914613: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-21 03:37:32.915568: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-21 03:37:32.916086: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6731 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070 Ti, pci bus id: 0000:03:00.0, compute capability: 8.6) 2021-06-21 03:37:33.131901: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2799830000 Hz 2021-06-21 03:39:47.360919: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-06-21 03:39:48.122008: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-06-21 03:39:48.124108: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. Round 0 learn:7206..7566 test:7926..8286 millis:2078 Round 1 learn:7207..7567 test:7927..8287 millis:1219 .... Round 28 learn:7234..7594 test:7954..8314 millis:840 Round 29 learn:7235..7595 test:7955..8315 millis:805

Exception in thread "main" java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (37672M) > maxPhysicalBytes (32176M) at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:695) at org.tensorflow.internal.c_api.AbstractTF_Tensor.withDeallocator(AbstractTF_Tensor.java:98) at org.tensorflow.Session.run(Session.java:694) at org.tensorflow.Session.access$100(Session.java:72) at org.tensorflow.Session$Runner.runHelper(Session.java:381) at org.tensorflow.Session$Runner.run(Session.java:329)

Craigacp commented 3 years ago

What JVM version and GC algorithm are you using?

x5w46fxdx commented 3 years ago

What JVM version and GC algorithm are you using?

JVM version 10.0.2 . The default configuration of GC has not been modified The CPU version is running normally, but the GPU version has a memory leak... This is the case for compiled bytecode versions 8 and 10.

Craigacp commented 3 years ago

Are you enabling any GPU specific configuration options at runtime (e.g. CUDA unified memory)? Is it possible to get a heap dump just before it runs out of memory?

The GPU library is identical to the CPU library from the perspective of the Java code, the only difference is that the TensorFlow native library is compiled differently. That OOM comes out of JavaCPP which we use to provide native interop, and it monitors the off heap memory usage by looking at the process's RSS. This is a lossy accounting which can be confused by different forms of memory mapping like ZGC's triple mapped heap, and potentially by things like CUDA unified memory which add additional pages to the process's address space. @saudet have you seen any interactions between CUDA & RSS in DL4J?

x5w46fxdx commented 3 years ago

The default GPU configuration has not been modified. The previous post was the complete log. The runtime system cache keeps increasing until it is abnormal. The Java heap behaves normally. The memory closed the session and was not released.

x5w46fxdx commented 3 years ago

--------------normal situation-------------------- MemTotal: 65893512 kB MemFree: 45787260 kB MemAvailable: 51365956 kB Buffers: 86248 kB Cached: 6043840 kB SwapCached: 0 kB Active: 14608748 kB Inactive: 3874096 kB Active(anon): 12306900 kB Inactive(anon): 186920 kB Active(file): 2301848 kB Inactive(file): 3687176 kB Unevictable: 144 kB Mlocked: 144 kB SwapTotal: 3999740 kB SwapFree: 3999740 kB Dirty: 420 kB Writeback: 0 kB AnonPages: 12353112 kB Mapped: 1564308 kB Shmem: 191996 kB Slab: 759320 kB SReclaimable: 315752 kB SUnreclaim: 443568 kB KernelStack: 40464 kB PageTables: 117828 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 36946496 kB Committed_AS: 37782880 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 1307792 kB DirectMap2M: 37414912 kB DirectMap1G: 30408704 kB --------------Abnormal Situation--------------- MemTotal: 65893512 kB MemFree: 24082252 kB MemAvailable: 29641232 kB Buffers: 86248 kB Cached: 22687304 kB SwapCached: 0 kB Active: 19231648 kB Inactive: 20536920 kB Active(anon): 16949196 kB Inactive(anon): 16881028 kB Active(file): 2282452 kB Inactive(file): 3655892 kB Unevictable: 144 kB Mlocked: 144 kB SwapTotal: 3999740 kB SwapFree: 3999740 kB Dirty: 420 kB Writeback: 0 kB AnonPages: 16995420 kB Mapped: 18627676 kB Shmem: 16886128 kB Slab: 954420 kB SReclaimable: 346740 kB SUnreclaim: 607680 kB KernelStack: 40592 kB PageTables: 155296 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 36946496 kB Committed_AS: 57382128 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 1285264 kB DirectMap2M: 37437440 kB DirectMap1G: 30408704 kB

Java Process Info (/proc/pid/smaps) download url : https://cowtransfer.com/s/5e3ad56b120943 password:743632 Valid for 24 hours

saudet commented 3 years ago

Could you try again with 0.4.0-SNAPSHOT to make sure this hasn't been fixed?

The GPU library is identical to the CPU library from the perspective of the Java code, the only difference is that the TensorFlow native library is compiled differently. That OOM comes out of JavaCPP which we use to provide native interop, and it monitors the off heap memory usage by looking at the process's RSS. This is a lossy accounting which can be confused by different forms of memory mapping like ZGC's triple mapped heap, and potentially by things like CUDA unified memory which add additional pages to the process's address space. @saudet have you seen any interactions between CUDA & RSS in DL4J?

No, there's nothing in CUDA that I know of that affects RSS. We can always set the "org.bytedeco.javacpp.maxbytes" system property to 0 to disable this behavior though: http://bytedeco.org/javacpp/apidocs/org/bytedeco/javacpp/Pointer.html#maxBytes

x5w46fxdx commented 3 years ago

Use 0.4.0-SNAPSHOT to find no GPU device. But 0.3.1 does not have this problem. Other info / logs 2021-06-22 13:15:20.941703: I external/org_tensorflow/tensorflow/core/common_runtime/direct_session.cc:361] Device mapping: no known devices.

--------deviceQuery---------- ./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce RTX 3070 Ti" CUDA Driver Version / Runtime Version 11.2 / 11.2 CUDA Capability Major/Minor version number: 8.6 Total amount of global memory: 7981 MBytes (8368685056 bytes) (48) Multiprocessors, (128) CUDA Cores/MP: 6144 CUDA Cores GPU Max Clock rate: 1770 MHz (1.77 GHz) Memory Clock rate: 9501 Mhz Memory Bus Width: 256-bit L2 Cache Size: 4194304 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.2, NumDevs = 1, Device0 = GeForce RTX 3070 Ti Result = PASS --------cat /proc/driver/nvidia/version-------- NVRM version: NVIDIA UNIX x86_64 Kernel Module 460.84 Wed May 26 20:14:59 UTC 2021 GCC version: gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)

------------nvidia-smi-------- Tue Jun 22 13:19:12 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.84 Driver Version: 460.84 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 307... Off | 00000000:03:00.0 On | N/A | | 0% 53C P8 17W / 290W | 540MiB / 7981MiB | 5% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1985 G /usr/lib/xorg/Xorg 358MiB | | 0 N/A N/A 5913 G /usr/bin/gnome-shell 61MiB | | 0 N/A N/A 20461 G ...ion-2019.3.4/jbr/bin/java 3MiB | | 0 N/A N/A 25485 G ...token=8600309387955201148 112MiB | +-----------------------------------------------------------------------------+

------------------------------nvcc -V--------------------------------------- nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Jan_28_19:32:09_PST_2021 Cuda compilation tools, release 11.2, V11.2.142 Build cuda_11.2.r11.2/compiler.29558016_0

rnett commented 3 years ago

Check your CUDA and cudnn versions, 0.4.0 is on tensorflow 2.5.0, 3.1.0 was on 2.4.1, and I think the required CUDA version changed.

x5w46fxdx commented 3 years ago

0.4.0-SNAPSHOT/2.5.0 cuda_10.2.89/libcudnn8_8.2.0.53-1+cuda10.2/libcudnn8-dev_8.2.0.53-1+cuda10.2 cuda_11.2.1/libcudnn8_8.1.0.77-1+cuda11.2/libcudnn8-dev_8.1.0.77-1+cuda11.2 cuda_11.0.3/libcudnn8_8.0.4.30-1+cuda11.0/libcudnn8-dev_8.0.4.30-1+cuda11.0 cuda_11.3.1/libcudnn8_8.2.0.53-1+cuda11.3/libcudnn8-dev_8.2.0.53-1+cuda11.3/The new addition of the test None of the above can find the GPU .. 0.3.1/2.4.1 cuda_11.2.1/libcudnn8_8.1.0.77-1+cuda11.2/libcudnn8-dev_8.1.0.77-1+cuda11.2 cuda_11.0.3/libcudnn8_8.0.4.30-1+cuda11.0/libcudnn8-dev_8.0.4.30-1+cuda11.0 GPU can be found

saudet commented 3 years ago

Please set the "org.bytedeco.javacpp.logger.debug" system property to "true" to make sure it's loading the version of CUDA you think it's loading. TF will not be able to use any libraries loaded from CUDA != 11.0.x and cuDNN != 8.0.x. To make sure, you should completely remove any other versions from your system. As an alternative, you could also add to your dependencies the corresponding cuda-platform-redist artifact: https://github.com/bytedeco/javacpp-presets/tree/1.5.4/cuda

x5w46fxdx commented 3 years ago

If the problem cannot be solved, we can only abandon the Java version and use the Python version...

Change Nvidia driver version Change Java version Change cuda and cudnn version Trying hard still can't solve the problem.

x5w46fxdx commented 3 years ago

Thank you for helping me . @saudet @rnett @Craigacp

Craigacp commented 3 years ago

Can you successfully use the GPU from TF 2.5.0 in Python? That will narrow down if it's a CUDA versioning issue, or something in the way we package TF Java.

I'll take a look at the process mapping information later in the week to see if I can spot where the allocations are.

x5w46fxdx commented 3 years ago

Python GPU 2.4.1 No memory leaks. Java GPU 0.3.1/2.4.1 memory leak. Please fix the Java GPU 0.3.1/2.4.1 memory leak BUG..

Test out Python GPU 2.5.0 tomorrow..

Craigacp commented 3 years ago

We're trying to run it down, and the different tests will help us. The current master branch with 0.4.0-SNAPSHOT in has fixes for a few different memory leaks, and the upgrade to TF 2.5.0 touched a bunch of files. Knowing if the latest code fixes the leak you're seeing will help us figure out if it's something we've already dealt with or if it's a previously unseen issue which we need to fix.

x5w46fxdx commented 3 years ago

The 0.4.0-SNAPSHOT GPU version cannot use GPU. The GPU cannot be detected... I tried CUDA Toolkit >= 10.2 version. cuDNN >= 8.0.4.30 version

NeverSayXz commented 3 years ago

Our team meet the similar problem: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes(42164M) > maxphysicalBytes(32768M) . This OOM error occurs after running our service for several hours(load savedModels for inference) . 0.3.2 CPU version: No memory leak. 0.3.2 GPU version: has memory leak.