Open holiman opened 3 weeks ago
Tried to repro it
root@a04949b121fe:/# yes "/fuzztmp/10573375-mixed-8.json" | /neth/nethtest --trace -m -x 2>/dev/null | grep "time" | xargs -L1 bash checkmem.sh nethtest
PID 456170 proc nethtest | Rss: 728460 kB | Pss: 1417619 kB | Shared Clean 7348 kB | Shared Dirty 40 kB | Private 732420 kB
PID 456170 proc nethtest | Rss: 751540 kB | Pss: 1464707 kB | Shared Clean 7348 kB | Shared Dirty 48 kB | Private 754788 kB
PID 456170 proc nethtest | Rss: 775672 kB | Pss: 1512089 kB | Shared Clean 7292 kB | Shared Dirty 32 kB | Private 778992 kB
PID 456170 proc nethtest | Rss: 799328 kB | Pss: 1559395 kB | Shared Clean 7360 kB | Shared Dirty 32 kB | Private 802492 kB
PID 456170 proc nethtest | Rss: 823460 kB | Pss: 1608071 kB | Shared Clean 7364 kB | Shared Dirty 32 kB | Private 826936 kB
PID 456170 proc nethtest | Rss: 847404 kB | Pss: 1658023 kB | Shared Clean 7356 kB | Shared Dirty 32 kB | Private 853024 kB
PID 456170 proc nethtest | Rss: 874820 kB | Pss: 1713117 kB | Shared Clean 7412 kB | Shared Dirty 28 kB | Private 880676 kB
After a couple of minutes:
PID 456170 proc nethtest | Rss: 4451112 kB | Pss: 8861466 kB | Shared Clean 7420 kB | Shared Dirty 28 kB | Private 4443600 kB
PID 456170 proc nethtest | Rss: 4451112 kB | Pss: 8861496 kB | Shared Clean 7436 kB | Shared Dirty 28 kB | Private 4443584 kB
A few minutes later it has gone down a bit again:
PID 456170 proc nethtest | Rss: 4329912 kB | Pss: 8619118 kB | Shared Clean 7540 kB | Shared Dirty 32 kB | Private 4322404 kB
PID 456170 proc nethtest | Rss: 4329912 kB | Pss: 8619137 kB | Shared Clean 7496 kB | Shared Dirty 32 kB | Private 4322384 kB
Seems to stabilize around here
PID 456170 proc nethtest | Rss: 4351724 kB | Pss: 8662759 kB | Shared Clean 7492 kB | Shared Dirty 36 kB | Private 4344196 kB
So yeah, no obvious easily repdocucible leak.
memchecker-script:
#!/bin/bash
if [[ -z "$2" ]]; then
exit 0
fi
for pid in $(ps -ef | awk '{print $2}'); do
if [[ -z "$pid" ]]; then
continue
fi
if [[ $pid == "PID" ]]; then
continue
fi
a=$(ps -p $pid -o comm=)
if [[ $a != "nethtest" ]]; then
continue
fi
if [ -f /proc/$pid/smaps ]; then
rss=$(awk 'BEGIN {i=0} /^Rss/ {i = i + $2} END {print i}' /proc/$pid/smaps)
pss=$(awk 'BEGIN {i=0} /^Pss/ {i = i + $2 + 0.5} END {print i}' /proc/$pid/smaps)
sc=$(awk 'BEGIN {i=0} /^Shared_Clean/ {i = i + $2} END {print i}' /proc/$pid/smaps)
sd=$(awk 'BEGIN {i=0} /^Shared_Dirty/ {i = i + $2} END {print i}' /proc/$pid/smaps)
pc=$(awk 'BEGIN {i=0} /^Private_Clean/ {i = i + $2} END {print i}' /proc/$pid/smaps)
pd=$(awk 'BEGIN {i=0} /^Private_Dirty/ {i = i + $2} END {print i}' /proc/$pid/smaps)
echo "PID $pid proc $a | Rss: $rss kB | Pss: $pss kB | Shared Clean $sc kB | Shared Dirty $sd kB | Private $(($pd + $pc)) kB"
fi
done
After running for a few hours on a server, nethtest
is way larger than any of the other:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2537959 root 20 0 537.8g 13.8g 64148 R 53.3 5.5 213:47.46 nethtest
2537951 root 20 0 53.2g 1.6g 28348 S 54.6 0.6 113:57.21 java
2537955 root 20 0 11.0g 1.5g 41004 S 175.2 0.6 6,25 erigon_vm
cc @MarekM25 @LukaszRozmej any ideas?
After running for a few hours on a server,
nethtest
is way larger than any of the other:PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2537959 root 20 0 537.8g 13.8g 64148 R 53.3 5.5 213:47.46 nethtest 2537951 root 20 0 53.2g 1.6g 28348 S 54.6 0.6 113:57.21 java 2537955 root 20 0 11.0g 1.5g 41004 S 175.2 0.6 6,25 erigon_vm
cc @MarekM25 @LukaszRozmej any ideas?
We either have to reproduce, or if you have similar situation next time, you could grab us memory dump/snapshot? You can use https://www.jetbrains.com/help/dotmemory/Working_with_dotMemory_Command-Line_Profiler.html
A day later (same server), it has gone from 13.8
to 15.5
:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2537959 root 20 0 538.6g 15.5g 64164 R 50.0 6.2 28,36 nethtest
2537951 root 20 0 55.0g 1.8g 28348 S 54.6 0.7 15,18 java
I'll try to dump it.... But, it's going to be a pretty damn large dump..?
Meh getting that thing to work inside a docker on a remote server seems not trivial
root@fb2a64aee5d9:/# /fuzztmp/JetBrains.dotMemory.linux-x64.2024.2.7/linux-x64/dotMemory -h
Unhandled exception. System.Exception: XOpenDisplay failed
at Avalonia.X11.AvaloniaX11Platform.Initialize(X11PlatformOptions options) in Z:\BuildAgent\work\f216b9e13ea6fd05\Avalonia\src\Avalonia.X11\X11Platform.cs:line 55
at Avalonia.AvaloniaX11PlatformExtensions.<>c.<UseX11>b__0_0() in Z:\BuildAgent\work\f216b9e13ea6fd05\Avalonia\src\Avalonia.X11\X11Platform.cs:line 354
at Avalonia.AppBuilder.SetupUnsafe() in Z:\BuildAgent\work\f216b9e13ea6fd05\Avalonia\src\Avalonia.Controls\AppBuilder.cs:line 328
at Avalonia.AppBuilder.Setup() in Z:\BuildAgent\work\f216b9e13ea6fd05\Avalonia\src\Avalonia.Controls\AppBuilder.cs:line 316
Meh getting that thing to work inside a docker on a remote server seems not trivial
root@fb2a64aee5d9:/# /fuzztmp/JetBrains.dotMemory.linux-x64.2024.2.7/linux-x64/dotMemory -h Unhandled exception. System.Exception: XOpenDisplay failed at Avalonia.X11.AvaloniaX11Platform.Initialize(X11PlatformOptions options) in Z:\BuildAgent\work\f216b9e13ea6fd05\Avalonia\src\Avalonia.X11\X11Platform.cs:line 55 at Avalonia.AvaloniaX11PlatformExtensions.<>c.<UseX11>b__0_0() in Z:\BuildAgent\work\f216b9e13ea6fd05\Avalonia\src\Avalonia.X11\X11Platform.cs:line 354 at Avalonia.AppBuilder.SetupUnsafe() in Z:\BuildAgent\work\f216b9e13ea6fd05\Avalonia\src\Avalonia.Controls\AppBuilder.cs:line 328 at Avalonia.AppBuilder.Setup() in Z:\BuildAgent\work\f216b9e13ea6fd05\Avalonia\src\Avalonia.Controls\AppBuilder.cs:line 316
We have a dockerfile with diagnostic tools: https://github.com/NethermindEth/nethermind/blob/master/Dockerfile.diag#L38
Ran into this crash with the fuzzer:
Somewhere after
42M
invocations, the nethermind instance went OOM (the testcase was nothing special). Possibly nethermind in batch mode has some memory leak.TODO investigate.
file which potentially triggers the leak,
10573375-mixed-8.json
: