Router-to-router mode in performance test is very bad

csibbitt commented 4 years ago

When testing in "router-to-router mode" the SAF QDR fails to receive a large number of the metrics generated by telemetry-bench. In "direct mode" this is not observed.

r-to-r: telemetry-bench -> client QDR -> SAF QDR -> Smart Gateway direct: telemetry-bench -> SAF QDR -> Smart Gateway

Here is a comparison of r-to-r vs. direct mode.

The command used is always ./performance-test.sh -t tb -c 30 -h 10 -p 1000 -i 10 -n 10
We are looking specifically at the "Received by QDR" stat. It should read exactly 3000010 in all cases.

Direct Mode (presettled)

r-to-r mode (presettled)

It was suggested that requiring acknowledgements ("unsettled delivery") might improve the situation. While this does increase the reliability considerably (at the cost of throughput and CPU utilization, though nothing major at this scale), some messages ARE still lost ("Received by QDR" < 3000010):

Direct Mode (unsettled)

r-to-r mode (unsettled)

Our best guess is that these need tuning, but currently the operator does not seem to expose them:

maxSessionFrames (integer): Session incoming window measured in transfer frames for sessions created on this connection.
linkCapacity (integer): The capacity of links within this connection, in terms of message deliveries.

csibbitt commented 4 years ago

Explore and fix cause of dropped messages in router-to-router link

csibbitt commented 4 years ago

To switch from "r-to-r mode" to "direct mode", change the amqp url from "qdr-test" to "qdr-white":

diff --git a/tests/performance-test/deploy/performance-test-job-tb.yml.template b/tests/performance-test/deploy/performance-test-job-tb.yml.template
index efca968..db3f595 100644
--- a/tests/performance-test/deploy/performance-test-job-tb.yml.template
+++ b/tests/performance-test/deploy/performance-test-job-tb.yml.template
@@ -17,7 +17,7 @@ spec:
         - name: performance-test
           image: quay.io/redhat-service-assurance/telemetry-bench
           imagePullPolicy: Always
-          args: ["-hostprefix", "<<PREFIX>>", "-hosts", "<<HOSTS>>", "-plugins", "<<PLUGINS>>", "-instances", "1", "-send", "<<COUNT>>", "-interval", "<<INTERVAL>>", "-startmetricenable", "-verbose", "amqp://qdr-test.sa-telemetry.svc.cluster.local:5672/collectd/telemetry/"]
+          args: ["-hostprefix", "<<PREFIX>>", "-hosts", "<<HOSTS>>", "-plugins", "<<PLUGINS>>", "-instances", "1", "-send", "<<COUNT>>", "-interval", "<<INTERVAL>>", "-startmetricenable", "-verbose", "amqp://qdr-white.sa-telemetry.svc.cluster.local:5672/collectd/telemetry/"]
       affinity:
         podAntiAffinity:
           preferredDuringSchedulingIgnoredDuringExecution:

To switch from presettled to unsettled, add "-ack" to the command line:

diff --git a/tests/performance-test/deploy/performance-test-job-tb.yml.template b/tests/performance-test/deploy/performance-test-job-tb.yml.template
index efca968..ce8dfa6 100644
--- a/tests/performance-test/deploy/performance-test-job-tb.yml.template
+++ b/tests/performance-test/deploy/performance-test-job-tb.yml.template
@@ -17,7 +17,7 @@ spec:
         - name: performance-test
           image: quay.io/redhat-service-assurance/telemetry-bench
           imagePullPolicy: Always
-          args: ["-hostprefix", "<<PREFIX>>", "-hosts", "<<HOSTS>>", "-plugins", "<<PLUGINS>>", "-instances", "1", "-send", "<<COUNT>>", "-interval", "<<INTERVAL>>", "-startmetricenable", "-verbose", "amqp://qdr-test.sa-telemetry.svc.cluster.local:5672/collectd/telemetry/"]
+          args: ["-hostprefix", "<<PREFIX>>", "-hosts", "<<HOSTS>>", "-plugins", "<<PLUGINS>>", "-instances", "1", "-send", "<<COUNT>>", "-interval", "<<INTERVAL>>", "-startmetricenable", "-verbose", "-ack", "amqp://qdr-test.sa-telemetry.svc.cluster.local:5672/collectd/telemetry/"]
       affinity:
         podAntiAffinity:
           preferredDuringSchedulingIgnoredDuringExecution:

pleimer commented 4 years ago

Thanks for posting this. I would like to add that the "maxFrameSize" property could influence throughput too. It affects whether or not a large message is sent across one or multiple frames.

Of course, if we can set a large number of frames per session, this may not matter as much. It could also be that collectd plugins do not send very large messages.

infrawatch / telemetry-framework