hyperledger / besu

An enterprise-grade Java-based, Apache 2.0 licensed Ethereum client https://wiki.hyperledger.org/display/besu
https://www.hyperledger.org/projects/besu
Apache License 2.0
1.5k stars 824 forks source link

besu-vm in (batch mode) memory leak #5087

Closed holiman closed 1 year ago

holiman commented 1 year ago

If I use besu-vm in batch-mode for a couple of hours, or ~15-20K testcases, it eventually runs out of memory.

Some chosen portions of the crash log:

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
#   JVM is running with Zero Based Compressed Oops mode in which the Java heap is
#     placed in the first 32GB address space. The Java Heap base address is the
#     maximum limit for the native heap growth. Please use -XX:HeapBaseMinAddress
#     to set the Java Heap base and to place the Java Heap above 32GB virtual address.
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2790), pid=3592000, tid=3592042
#
# JRE version: OpenJDK Runtime Environment (17.0.5+8) (build 17.0.5+8-Ubuntu-2ubuntu122.04)
# Java VM: OpenJDK 64-Bit Server VM (17.0.5+8-Ubuntu-2ubuntu122.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/martin/workspace/goevmlab/cmd/generic-fuzzer/core.3592000)
#

---------------  S U M M A R Y ------------

Command Line: -Dsecp256k1.randomize=false org.hyperledger.besu.evmtool.EvmTool --nomemory --json state-test

Host: Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz, 4 cores, 31G, Ubuntu 22.04.1 LTS
Time: Fri Feb 10 20:31:39 2023 CET elapsed time: 6414.515872 seconds (0d 1h 46m 54s)

---------------  T H R E A D  ---------------

Current thread (0x00007f1c5013de00):  JavaThread "C1 CompilerThread0" daemon [_thread_in_vm, id=3592042, stack(0x00007f1c24200000,0x00007f1c24300000)]

Current CompileTask:
C1:6414515 8487       3       java.util.Collections::newSetFromMap (9 bytes)

Stack: [0x00007f1c24200000,0x00007f1c24300000],  sp=0x00007f1c242fde10,  free space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)

[...]

Threads with active compile tasks:
C1 CompilerThread0  6414516 8487       3       java.util.Collections::newSetFromMap (9 bytes)

VM state: not at safepoint (normal execution)

VM Mutex/Monitor currently owned by a thread:  ([mutex/lock_event])
[0x00007f1c50015d90] Metaspace_lock - owner thread: 0x00007f1c5013de00
[0x00007f1c50017d90] MethodData_lock - owner thread: 0x00007f1c5013de00

Heap address: 0x000000060ec00000, size: 7956 MB, Compressed Oops mode: Zero based, Oop shift amount: 3

CDS archive(s) mapped at: [0x0000000800000000-0x0000000800be3000-0x0000000800be3000), size 12464128, SharedBaseAddress: 0x0000000800000000, ArchiveRelocationMode: 0.
Compressed class space mapped at: 0x0000000800c00000-0x0000000840c00000, reserved size: 1073741824
Narrow klass base: 0x0000000800000000, Narrow klass shift: 0, Narrow klass range: 0x100000000

GC Precious Log:
 CPUs: 4 total, 4 available
 Memory: 31818M
 Large Page Support: Disabled
 NUMA Support: Disabled
 Compressed Oops: Enabled (Zero based)
 Heap Region Size: 4M
 Heap Min Capacity: 8M
 Heap Initial Capacity: 500M
 Heap Max Capacity: 7956M
 Pre-touch: Disabled
 Parallel Workers: 4
 Concurrent Workers: 1
 Concurrent Refinement Workers: 4
 Periodic GC: Disabled

[...]

Process Memory:
Virtual Size: 48176992K (peak: 48181252K)
Resident Set Size: 2621048K (peak: 2621048K) (anon: 591164K, file: 2029884K, shmem: 0K)
Swapped out: 564K
C-Heap outstanding allocations: 61960K, retained: 69263K
glibc malloc tunables: (default)

/proc/sys/kernel/threads-max (system-wide limit on the number of threads): 253998
/proc/sys/vm/max_map_count (maximum number of memory map areas a process may have): 65530
/proc/sys/kernel/pid_max (system-wide limit on number of process identifiers): 4194304

container (cgroup) information:
container_type: cgroupv2
cpu_cpuset_cpus: not supported
cpu_memory_nodes: not supported
active_processor_count: 4
cpu_quota: not supported
cpu_period: not supported
cpu_shares: not supported
memory_limit_in_bytes: unlimited
memory_and_swap_limit_in_bytes: unlimited
memory_soft_limit_in_bytes: unlimited
memory_usage_in_bytes: 28915544 k
memory_max_usage_in_bytes: not supported
maximum number of tasks: unlimited
current number of tasks: 64

Steal ticks since vm start: 0
Steal ticks percentage since vm start:  0,000

CPU: total 4 (initial active 4) (2 cores per cpu, 2 threads per core) family 6 model 142 stepping 12 microcode 0xc6, cx8, cmov, fxsr, ht, mmx, 3dnowpref, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, lzcnt, tsc, tscinvbit, avx, avx2, aes, erms, clmul, bmi1, bmi2, adx, fma, vzeroupper, clflush, clflushopt
CPU Model and flags from /proc/cpuinfo:
model name  : Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Online cpus: 0-3
Offline cpus: 
BIOS frequency limitation: <Not Available>
Frequency switch latency (ns): 0
Available cpu frequencies: <Not Available>
Current governor: powersave
Core performance/turbo boost: <Not Available>

Memory: 4k page, physical 32582572k(306868k free), swap 2097148k(1940476k free)
Page Sizes: 4k

vm_info: OpenJDK 64-Bit Server VM (17.0.5+8-Ubuntu-2ubuntu122.04) for linux-amd64 JRE (17.0.5+8-Ubuntu-2ubuntu122.04), built on Oct 24 2022 13:04:44 by "buildd" with gcc 11.2.0

THis happens with commit 56ae1fcf5b59ccf1f4a0855133110e53354fec38 -- previously it worked fine, in december, so I think it's some memory leak that's been created between mid december and now.

I'll try to repro it better, with a bit of luck I can just feed the same testcase over and over in a loop and trigger it, and do a git bisect.

Invocation is

besu-vm  --nomemory --json state-test 

and feeding filenames via stdin.

holiman commented 1 year ago

It can be reproed within a few minutes by:

 yes /home/martin/workspace/goevmlab/evms/testdata/cases/00000006-naivefuzz-0.json | /home/martin/workspace/besu-vm state-test --nomemory

When it is 'bad', the mem consumption keeps rising

$ while [ true ]; do ps -o rss= `pidof java` | awk '{printf "%.0f MB\n", $1 / 1024}'; sleep 1; done
32 MB
137 MB
166 MB
175 MB
175 MB
220 MB
247 MB
257 MB
272 MB
286 MB
303 MB
320 MB
338 MB
351 MB
369 MB
551 MB
565 MB
583 MB
604 MB
623 MB
643 MB
671 MB
689 MB
704 MB
719 MB
720 MB
734 MB
748 MB
766 MB
782 MB
798 MB
815 MB
830 MB
841 MB
856 MB
870 MB
886 MB
899 MB
913 MB
928 MB
942 MB
961 MB
979 MB

Whereas the 'good' becomes stable somewhere below 600 MB:

$ while [ true ]; do ps -o rss= `pidof java` | awk '{printf "%.0f MB\n", $1 / 1024}'; sleep 1; done
331 MB
415 MB
481 MB
563 MB
560 MB
561 MB
563 MB
576 MB
578 MB
578 MB
578 MB
578 MB
583 MB
583 MB
583 MB
583 MB
584 MB
577 MB
577 MB
577 MB
577 MB
577 MB
579 MB
579 MB
579 MB
579 MB
581 MB
575 MB
575 MB
575 MB
575 MB
575 MB
575 MB
575 MB
575 MB
575 MB

So, according to git bisect, the culprit is...

f8a0f874057a87e0f7ef864c9f289b39d370c0ae is the first bad commit
commit f8a0f874057a87e0f7ef864c9f289b39d370c0ae
Author: Justin Florentine <justin+github@florentine.us>
Date:   Mon Jan 23 09:01:39 2023 -0500

    KZG point eval precompile (#4860)

    * KZG implementation

    Signed-off-by: Justin Florentine <justin+github@florentine.us>

So https://github.com/hyperledger/besu/commit/f8a0f874057a87e0f7ef864c9f289b39d370c0ae ..

How, or why, I have no idea. I even extra-checked that commit and the one before, and it is indeed that commit which leaks memory.

macfarla commented 1 year ago

@jflo wanna take a look at this one?

holiman commented 1 year ago

It is loading this resource every time a test is executed: https://github.com/hyperledger/besu/commit/f8a0f874057a87e0f7ef864c9f289b39d370c0ae#diff-e614ff91ff732e88484a9e612a16f3bb91fd9889b3e06f0223cc33399e158fa9R52

jflo commented 1 year ago

yeah, @shemnon ran into this last week and has a fix for it pending: https://github.com/hyperledger/besu/pull/5030/commits/1aa8fb28f56a236404fbe661ce809adc4769f513. Will keep this issue open for retest after that PR hits main.

holiman commented 1 year ago

PR #5030 merged, so I assume this is fixed. I'll test later this evening, closing in the meanwhile