thingsboard / thingsboard-edge

Apache License 2.0
98 stars 74 forks source link

[Question] Message queue accumulation data #47

Closed WingedKitten closed 1 year ago

WingedKitten commented 1 year ago

Component

After the data into the message queue, seem only accumulation in one

_4K$O(CGO0DW$UN( OT}3Y

Description A clear and concise details.

Environment

WingedKitten commented 1 year ago

I try to connect the 200 devices, and for 5 seconds at a time the frequency of reported data, message queue began to accumulate data, I checked the server resources, memory, CPU and disk USES only a small part. What is limiting the consumption on the edge of the speed? @volodymyr-babak

WingedKitten commented 1 year ago

The same system configuration of thingsboard, access 200 devices access no message queue accumulation problems

WingedKitten commented 1 year ago

@ashvayka @volodymyr-babak Take a look at this problem, if we do not have the right to use because the configuration file?

volodymyr-babak commented 1 year ago

@WingedKitten

please describe your setup - what message queue type is used for cloud and edge, what rule chain are you using, what is device protocol is used to connect devices to edge, etc. Please provide as much information as possible so we are able to understand and reproduce the issue. At the moment it's not clear what exactly you are trying to achieve and what data is accumulated in the queue. Edge is not using queue to send the data to the cloud, so accumulating messages in the queue is related to some failures in the rule engine processing. Please take a look at rule engine statistics on cloud and edge - probably you'll find some error messages there.

WingedKitten commented 1 year ago

I use the rabbitmq message queue, rules chain is the default Using the MQTT protocol through the TB - gateway access The server configuration is 8 CPU 16 gb of memory I'm trying to do the performance test The frequency of data is 5 seconds at a time, at the same time there are 200 sets of equipment access But the edge can't rapid consumption of these data Use the same methods and equipment access to TB cloud There is no pressure. And check the edge server resources, CPU use only about 2400 MHZ, use 983 MB of memory.

WingedKitten commented 1 year ago

So, I don't understand why?The problem is the configuration file?

WingedKitten commented 1 year ago

Thingsboard and edge version is 3.4.3

WingedKitten commented 1 year ago

I was hoping to test 1000 devices, but probably around 50 message queue began to pile up

WingedKitten commented 1 year ago

image This is edge connection database server disk when pressure situation, does not seem to be IO pressure

WingedKitten commented 1 year ago

Database is postgresql14.4 edge TB cloud use postgresql14.4 + Cassandra I thought that is the reason of the database, but the disk pressure seems to be very small Edge does not use all the resources image

WingedKitten commented 1 year ago

@volodymyr-babak

volodymyr-babak commented 1 year ago

@WingedKitten

please provide more details on what is "But the edge can't rapid consumption of these data". What is consumption in your case? Why do you think consumption is not good enough? Are you referring to the number of messages that are accumulated in the rabbitMQ queue? Did you check rule engine statistics ('Api usage' menu)? Do you see any problems there? I don't understand at the moment what is your issue - please provide more details with screen shots from TB component on what is slow consumption rate by your case.

WingedKitten commented 1 year ago

image image image image image The rabbitMQ cumulative number of messages in the more and more Remote sensing data latency is bigger and bigger @volodymyr-babak

WingedKitten commented 1 year ago

Send data already for one hour, telemetry shows or 30 minutes ago

WingedKitten commented 1 year ago

'Hourly telemetry persistence' only 'The amount of data transferred per hour' half, does it mean when insert data into The database too slow? @volodymyr-babak But the database disk usage is not high

volodymyr-babak commented 1 year ago

@WingedKitten

I see that you have JS executions in your API Usage statistics, but in the previous post you mentioned that you have default rule chains. Default rule chains do not use JS executions. Please provide configuration of your rule chains.

WingedKitten commented 1 year ago

That is a test for the first time in the afternoon, using the chain I revised rules, but later in order to rule out a variable, I changed the default, you can see behind several hours is no js @volodymyr-babak image

volodymyr-babak commented 1 year ago

@WingedKitten that makes sense, thanks for clarification. Please use in-memory queue for testing, instead of RabbitMQ, and let me know if you have any delays in message transfer to the cloud in this case. Additionally, please attach thingsboard.log and tb-edge.log files to the ticket.

WingedKitten commented 1 year ago

After the switch to the in-memory queue, no delay, thanks for your help. Why use the rabbitmq instead appeared after the delay? @volodymyr-babak

volodymyr-babak commented 1 year ago

@WingedKitten it is hard to answer without deep investigation of the problem. I see at least two possible issues:

  1. bug in the RabbitMQ queue implementation in the ThingsBoard product.
  2. consume rate of messages from RabbitMQ is low and must be increased to avoid delays.

I can propose you to use Kafka queue instead of RabbitMQ - performance of the Kafka queue type was already confirmed by many different customers and TB is able to process a lot of messages using Kafka as queue.

WingedKitten commented 1 year ago

My TB cloud is also use the rabbitmq, also in access 1000 devices, and no delay problems, only edge appear this problem @volodymyr-babak

WingedKitten commented 1 year ago

Hi, I found a problem that affect synchronous speed, the amount of data when cloud_event table increases to a certain number, edge and TB cloud synchronous speed dropped, I try to truncate cloud_event table, synchronous speed again, I guess because cloud_event increases the amount of data that has affected the query speed。 I do stress test with 1000 devices, last night in the morning found cloud_event table has 7.8 G the amount of data image At the same time, the edge and TB cloud transmission speed from begin to decline at about 4 o 'clock The database of the CPU utilization rate began to rise sharply in the 4 o 'clock. This is the use of the network: image

WingedKitten commented 1 year ago

Now I try to shorten the TTL, but frequent delete does not seem to be a good method, can be in the form of partitions?

volodymyr-babak commented 1 year ago

@WingedKitten

thanks for additional information. I had created a new ticket to partition cloud_event table: https://github.com/thingsboard/thingsboard-edge/issues/49

Additionally, I'll try to see what could be improved for your current test case until partition of cloud_event is done - probably we can add some more indexes for this table to make it more efficient in case of many records. I'll get back into this thread once we have some progress on this.

volodymyr-babak commented 1 year ago

@WingedKitten

Can you please do dump of your edge database and share it with me? Please do dump using next command: pg_dump -v -U <YOUR_DB_USER> -h <DB_IP_ADDRESS> -W -Fc tb-edge > tb-edge.dump

And please upload this file to google drive or some other public resource and provide link here. I would like to restore this DB locally and make sure that creation of additional indexes will have correct effect in your case.

WingedKitten commented 1 year ago

Sorry, in the busy work, before I can try to provide a 9.9 GB cloud_event table data, because I truncate the test table, you can use the data to fill in your data table, to simulate when cloud_event increases to a certain number, whether the partition has effect on him。 Can I use E-mail sent to you?

volodymyr-babak commented 1 year ago

Hi @WingedKitten,

I wanted to let you know that we partitioned the cloud event table by created_time in this commit: https://github.com/thingsboard/thingsboard-edge/commit/30df42d784e34fae41794601cbc20daa91fed7a8.

During our testing phase, we pushed 2000 messages per second for 1000 devices on the edge for 3 hours, and during this 3 hours 22 million cloud events were created. No lag between the cloud and the edge. At the moment, the size of cloud_event table is 7 GB.

We achieved processing speeds that were 20-50 times faster than before. One parameter that we tuned on the edge side was CLOUD_RPC_STORAGE_MAX_READ_RECORDS_COUNT. You can find this parameter in tb-edge.yml file and overwrite it using tb-edge.conf file. We used 5000 instead of the default value of 50 during the testing phase to fetch more records from the edge database during a single SELECT query.

If you encounter any further issues related to the performance of sending cloud events on the 3.5 release, feel free to reopen this ticket. Otherwise, I'll go ahead and close it.