yet another thingsboard performance test similar scenario

chienfuchen32 commented 10 months ago

Hi, @smatvienko-tb thanks for your awesome perf post sharing. I'd like to feed back a similar scenario perf test here. thingsboard-performance-test, it might not be as good as your description. this post is mostly focused on rule chain and message queue scalability on Thingsboard v3.5.1. If there's anything configuration not mentioned, please feel free reach me anytime. Thank you.

smatvienko-tb commented 10 months ago

Hi, @chienfuchen32 . Thank you for your feedback and your effort to test the ThingsBoard cluster on Azure. My article on a cluster performance is not released to show the insights on a cluster performance tuning. In general, I see not enough data to analyze the performance. I see the uniform message processing rate in the rule engine, but it is not clear is there any lag in the queue left. The msg count are not looking uniform, but it might be a reason. Please, check the telemetry ts that actually saved in database. You might find that the timestamp are not exactly 60 sec. Please, check the source code of the TbMsgCountNode and you will see that the node doing async and making the best effort 60s periods + awaiting in the rule engine mailbox + ruleNode executoin time. And periods might be not uniform and depends on rule-dispatcher executor available, the actorSystem.scheduler business. In other hand, the rule engine stats are designed to run periodically in scheduler thread pool separate from the rule engine. So the message count approach is not the same as I suggested in my performance test. Extending the count period will help to report numbers with less fluctuation. But please, try to count messages outside the ThingsBoard, using the Grafana dashboard for Kafka to get the data that not produced by the ThingsBoard itself. The 3rd party tool might help you to investigate issues as well. If you would like to unlock the full performance available, please, find the bottlenecks under your load and adjust your setup accordingly. In most of the cases, the bottle neck are the database I/O or N^M complexity in some custom rule-chain logic. You can use the stats output to find out where the I/O queue lag are building up. Another good instrument is JMX that can show you the internals on thread polls, Kafka producer Mbean. Redis pool related Mbean as well. With you the best results!

chienfuchen32 commented 10 months ago

Hi, @smatvienko-tb. I'll try to follow the stuff you've mentioned first: VisualVM, Kafka exporter and Grafana, and check third party software performance, metrics insight, like Azure managed database, Redis. If I can go deeper into the whole system. Hope I'll give you some feedback soon. Thanks for your advice.

thingsboard / performance-tests

yet another thingsboard performance test similar scenario #66