[KIEKER-1958] Examination of No Logging vs. Binary TCP

rju commented 5 days ago

JIRA Issue: KIEKER-1958 Examination of No Logging vs. Binary TCP Original Reporter: David Georg Reichelt

MooBench currently has lower execution times for binary and TCP writer than for no writer at all (DumpWriter), e.g.

/home/reichelt/workspaces/dissworkspace/moobench/frameworks/Kieker-java/results/raw-10-10-5.csv
1 2 3 4 5
mean " 0.1676" " 2.0350" " 26.2462" "587.5731" " 23.7804"
sd " 0.1352" " 7.6923" " 11.9931" "3018.1909" "4255.7879"
ci95% " 0.0003" " 0.0151" " 0.0235" " 5.9155" " 8.3412"
md25% " 0.1660" " 1.4080" " 24.8880" "560.4000" " 7.8510"
md50% " 0.1670" " 1.5560" " 25.4070" "563.6040" " 8.1110"
md75% " 0.1670" " 1.7590" " 26.1470" "584.3260" " 8.5180"
max " 72.7590" "3926.2580" "5725.3260" "1099142.1380" "2268085.0420"
min " 0.1480" " 1.1660" " 8.2400" " 8.5560" " 7.3330"
6
mean " 17.1894"
sd " 19.1758"
ci95% " 0.0376"
md25% " 9.4620"
md50% " 18.6290"
md75% " 20.8520"
max "9803.0360"
min " 7.5000"

(on Raspberry Pi 4) or

kind: Kieker-java experiments: - timestamp: 1682336292.51602 measurements: No instrumentation : [ 1.115724e-01 , 4.14653e-02 , 8.12705e-05 , 1.1e-01 , 1.11e-01 , 1.12e-01 , 2.9207e+01 , 1.07e-01 ] Deactivated probe : [ 6.411729e-01 , 5.078517e+00 , 9.953711e-03 , 6.02e-01 , 6.09e-01 , 6.42e-01 , 3.049152e+03 , 5.83e-01 ] No logging : [ 1.402871e+01 , 1.864584e+01 , 3.654517e-02 , 1.2776e+01 , 1.4383e+01 , 1.4868e+01 , 1.096994e+04 , 4.199e+00 ] Text file : [ 3.224599e+02 , 6.532146e+01 , 1.280277e-01 , 3.17507e+02 , 3.19641e+02 , 3.2291e+02 , 1.22996e+04 , 4.782e+00 ] Binary file : [ 6.014899e+00 , 1.758702e+01 , 3.446992e-02 , 4.303e+00 , 4.754e+00 , 7.965e+00 , 1.052224e+04 , 3.781e+00 ] Binary TCP : [ 9.388779e+00 , 8.308957e+00 , 1.628526e-02 , 8.61e+00 , 8.852e+00 , 9.38e+00 , 4.213459e+03 , 3.98e+00 ]

(from node3, Kiel server for MooBench experiments).

This should be examined. If there is a reasonable explanation for this behaviour, it should be added to https://kieker-monitoring.net/performance-benchmarks/

rju commented 4 days ago

author DaGeRe -- Mon, 22 May 2023 16:33:02 +0200

The problem can quickly be reduced when setting the execution only to configurations 2, 3 and 4:

[info] Intermediary results No logging Average: 4971.03 Standard Deviation: 61681.6 Count: 2000000
[info] Intermediary results Text file Average: 83528.3 Standard Deviation: 74730.2 Count: 2000000
[info] Intermediary results Binary file Average: 3082.11 Standard Deviation: 62207.4 Count: 2000000

It also stays the same when reducing the number of calls and increasing the sleep time: export SLEEP_TIME=5; export NUM_OF_LOOPS=3; export TOTAL_NUM_OF_CALLS=1000000; ./benchmark.sh

To speed up the analyis, I’ve create a branch for this: https://github.com/kieker-monitoring/moobench/tree/examine-noLoggingBinaryFileDifference

The reason seems not to be inside of the GCs, since here, the with-writing-variant (4) has even more GC time (obtained after JAVA_ARGS="-XX:+PrintCompilation -verbose:gc"):

[reicheltlaptop-uni results]$ for file in "output__2.txt"; do cat $file | grep "GC"; done | awk '

{print $5}' | tr "," "." | awk '{sum+=$1} END {print sum/NR}'_
0.0100039
[reicheltlaptop-uni results]$ for file in "output_4.txt"; do cat $file | grep "GC"; done | awk '{print $5}

' | tr "," "." | awk '

{sum+=$1}

END

{print sum/NR}

'
0.0200146

Therefore, we’ll need more factorial experiments / ideas how to dig deeper (potentially using perf, or other debugging tools).

rju commented 4 days ago

author DaGeRe -- Mon, 22 May 2023 17:35:42 +0200

After recording the data using

sudo perf record -F 500 -p 20315 -g -o perf4.data – sleep 10

and analysis using

sudo perf script -i perf4.data | $FLAME/stackcollapse-perf.pl | $FLAME/flamegraph.pl > perf_flame.svg

the following graphs were created:

The behaviour is obviously significantly different, but I cannot explain this behaviour.

Additionally, this currently only works when starting the recording externally, when I start the recording using

perf record -- java $DEFAULT_JVM_OPTS $JAVA_OPTS $BENCHMARK_OPTS \\
    -cp $CLASSPATH \\
    moobench.benchmark.BenchmarkMain \\

the mapping cannot be reproduced (and the Java methods are just something like “0x…”).

I cannot realy explain why the flame graph with no logging is faster, and even contains less method calls. We can discuss this tomorrow.

rju commented 4 days ago

author DaGeRe -- Mon, 22 May 2023 17:41:06 +0200

Sorry, this seems to be caused by the logging timespan selection. I’ve fixed the recording, and these are the graphs:

(Binary file)

(No Logging)

Now, they seem way more similiar, but there are still some differences in the details.

rju commented 4 days ago

author DaGeRe -- Thu, 25 May 2023 17:29:08 +0200

I’ve now uploaded the code for generating the perf information: https://github.com/kieker-monitoring/moobench/commit/ce6c5da2f6b5675d82963119165bad21da19816b By running ./benchmark in frameworks/Kieker-java, perf_2.data and perf_4.data are created, which contain the monitoring data for configuration index 2/4.

Afterwards, the flamegraph can be created (if $FLAME_HOME contains the flamegraph repo) with:
perf script -i perf_2.data | $FLAME_HOME/stackcollapse-perf.pl > out2.folded
cat out2.folded | $FLAME_HOME/flamegraph.pl > out2.svg
perf script -i perf_4.data | $FLAME_HOME/stackcollapse-perf.pl > out4.folded
cat out4.folded | $FLAME_HOME/flamegraph.pl > out4.svg
$FLAME_HOME/difffolded.pl out4.folded out2.folded | $FLAME_HOME/flamegraph.pl > diff.svg

This creates the following flamegraphs:

The diff file is the last one, showing that MonitoringWriterThread.run became 10% faster with the 2-configuration (DumpWriter), which is surprising, because it should be the other way round.

rju commented 4 days ago

author DaGeRe -- Sun, 28 May 2023 19:46:16 +0200

After some more experimentation, it seems like there is a nearly constant or slightly increasing factor between no logging and binary writing:

data.csv

(Measurements obtained by for depth in 2 10 20 50 75 100 200 300 400 500 600 750 1000; do export RECURSION_DEPTH=$depth; ./benchmark.sh &> $depth.txt; done; Analaysis via for depth in 2 10 20 50 75 100 200 300 400 500 600 750; do echo -n $depth" "; tail -n 2 $depth.txt | awk '{ if (NR%2 != 0) line=$7; else

{printf("%s %s\n", line, $7); line="";}

}' ; done).

1000 seems unfortunately to not be executable successful (probably, the JVM is short on RAM; but it seems to me like the evolution is pretty clear from this experiment).

As next experiment, I’ll try the same with TOTAL_NUM_OF_CALLS: for executions in 100 1000 100000 1000000 2000000 10000000; do export TOTAL_NUM_OF_CALLS=$executions; ./benchmark.sh &> num_$executions.txt; done. Maybe this yields additional knowledge.

If this is not the case, the only option imho is to execute factorial experiments with tweaked writers, and see how these behave.

rju commented 4 days ago

author DaGeRe -- Tue, 6 Jun 2023 10:00:24 +0200

Next step: Add loop of additions to DumpWriter, to find out for which artificial overhead (additions) DummyWriter and FileWriter are equally fast.

Additionally: Examine context switches: https://unix.stackexchange.com/questions/259710/understanding-linux-perf-sched-switch-and-context-switches

Problem for MooBench: Distinguishing between instrumentation overhead, measurement overhead + writing overhead is not possible anymore, if the additionallability of overhead is not existing.

rju commented 4 days ago

author DaGeRe -- Fri, 9 Jun 2023 18:30:54 +0200

First results show, this really works:

This was obtained using https://github.com/kieker-monitoring/kieker/tree/KIEKER-1958-examinationLogging , then copying the jar into the MooBench folder and executing export SLEEP_TIME=1; for i in

{1..100}

; do export DUMMY_LOOPS=$i; ./benchmark.sh &> $i.txt; done

Using more loops, the results should get a little bit more stable. For the paper, I’ll arrange this in a reproducible way.

This is a clear indicator that our theory, that the lock contention caused the problem, was true. Now the next question is: What would a more realistic benchmark for measuring, but not writing be?

rju commented 4 days ago

author DaGeRe -- Tue, 6 Aug 2024 10:47:20 +0200

As a result of this, the DumpQueue has been introduced to Kieker (https://github.com/kieker-monitoring/kieker/blob/main/monitoring/core/src/kieker/monitoring/writer/dump/DumpQueue.java ), and the results have been published in Reichelt, David Georg, Reiner Jung, and André van Hoorn. "More is Less in Kieker? The Paradox of No Logging Being Slower Than Logging." (2023).. Therefore, I delete the stale branch KIEKER-1958-examinationLogging. The branch contained the DumpQueue, that has been merged to main, and the option for dummy loops in the DumpWriter, that are not necessary for production (while the DumpQueue as the final solution is necessary in Kieker releases for MooBenchs factorial experiments).

kieker-monitoring / kieker

[KIEKER-1958] Examination of No Logging vs. Binary TCP #1888