infrawatch / telemetry-framework

Telemetry Framework contains the installation tools for delivery of the Service Assurance Framework [Tech Preview]
Apache License 2.0
10 stars 10 forks source link

Router-to-router mode in performance test is very bad #96

Open csibbitt opened 4 years ago

csibbitt commented 4 years ago

When testing in "router-to-router mode" the SAF QDR fails to receive a large number of the metrics generated by telemetry-bench. In "direct mode" this is not observed.

r-to-r: telemetry-bench -> client QDR -> SAF QDR -> Smart Gateway direct: telemetry-bench -> SAF QDR -> Smart Gateway

Here is a comparison of r-to-r vs. direct mode.

Direct Mode (presettled)

image

r-to-r mode (presettled)

image

It was suggested that requiring acknowledgements ("unsettled delivery") might improve the situation. While this does increase the reliability considerably (at the cost of throughput and CPU utilization, though nothing major at this scale), some messages ARE still lost ("Received by QDR" < 3000010):

Direct Mode (unsettled)

image

r-to-r mode (unsettled)

image

Our best guess is that these need tuning, but currently the operator does not seem to expose them:

csibbitt commented 4 years ago

Explore and fix cause of dropped messages in router-to-router link

csibbitt commented 4 years ago

To switch from "r-to-r mode" to "direct mode", change the amqp url from "qdr-test" to "qdr-white":

diff --git a/tests/performance-test/deploy/performance-test-job-tb.yml.template b/tests/performance-test/deploy/performance-test-job-tb.yml.template
index efca968..db3f595 100644
--- a/tests/performance-test/deploy/performance-test-job-tb.yml.template
+++ b/tests/performance-test/deploy/performance-test-job-tb.yml.template
@@ -17,7 +17,7 @@ spec:
         - name: performance-test
           image: quay.io/redhat-service-assurance/telemetry-bench
           imagePullPolicy: Always
-          args: ["-hostprefix", "<<PREFIX>>", "-hosts", "<<HOSTS>>", "-plugins", "<<PLUGINS>>", "-instances", "1", "-send", "<<COUNT>>", "-interval", "<<INTERVAL>>", "-startmetricenable", "-verbose", "amqp://qdr-test.sa-telemetry.svc.cluster.local:5672/collectd/telemetry/"]
+          args: ["-hostprefix", "<<PREFIX>>", "-hosts", "<<HOSTS>>", "-plugins", "<<PLUGINS>>", "-instances", "1", "-send", "<<COUNT>>", "-interval", "<<INTERVAL>>", "-startmetricenable", "-verbose", "amqp://qdr-white.sa-telemetry.svc.cluster.local:5672/collectd/telemetry/"]
       affinity:
         podAntiAffinity:
           preferredDuringSchedulingIgnoredDuringExecution:

To switch from presettled to unsettled, add "-ack" to the command line:

diff --git a/tests/performance-test/deploy/performance-test-job-tb.yml.template b/tests/performance-test/deploy/performance-test-job-tb.yml.template
index efca968..ce8dfa6 100644
--- a/tests/performance-test/deploy/performance-test-job-tb.yml.template
+++ b/tests/performance-test/deploy/performance-test-job-tb.yml.template
@@ -17,7 +17,7 @@ spec:
         - name: performance-test
           image: quay.io/redhat-service-assurance/telemetry-bench
           imagePullPolicy: Always
-          args: ["-hostprefix", "<<PREFIX>>", "-hosts", "<<HOSTS>>", "-plugins", "<<PLUGINS>>", "-instances", "1", "-send", "<<COUNT>>", "-interval", "<<INTERVAL>>", "-startmetricenable", "-verbose", "amqp://qdr-test.sa-telemetry.svc.cluster.local:5672/collectd/telemetry/"]
+          args: ["-hostprefix", "<<PREFIX>>", "-hosts", "<<HOSTS>>", "-plugins", "<<PLUGINS>>", "-instances", "1", "-send", "<<COUNT>>", "-interval", "<<INTERVAL>>", "-startmetricenable", "-verbose", "-ack", "amqp://qdr-test.sa-telemetry.svc.cluster.local:5672/collectd/telemetry/"]
       affinity:
         podAntiAffinity:
           preferredDuringSchedulingIgnoredDuringExecution:
pleimer commented 4 years ago

Thanks for posting this. I would like to add that the "maxFrameSize" property could influence throughput too. It affects whether or not a large message is sent across one or multiple frames.

Of course, if we can set a large number of frames per session, this may not matter as much. It could also be that collectd plugins do not send very large messages.