thingsboard / thingsboard-edge

Apache License 2.0
98 stars 74 forks source link

[Question] ThingsBoard Edge PE disconnects from cloud #57

Open akseerali opened 1 year ago

akseerali commented 1 year ago

Component

Description I am using ThingsBoard PE Perpetual license with ThingsBoard Cloud Maker. The issue is, Edge status is shown offline at the cloud. Below are some of the symptoms I have observed so far:

Below are the screenshots of Edge activity status from Cloud and Edge.

Questions

Environment

volodymyr-babak commented 1 year ago

Hello @akseerali,

To fully understand the issue you're experiencing, we would need some additional information. Could you please provide the complete log from your ThingsBoard Edge container? Additionally, if you could attach your docker-compose.yml file, it would be very helpful.

This additional information is crucial because, without a comprehensive log analysis, determining the root cause of your problem is challenging. Thank you in advance for your cooperation!

akseerali commented 1 year ago

Hi @volodymyr-babak

Please find attached the docker-compose configuration and edge log file.

tb-edge.log docker-compose.txt

akseerali commented 1 year ago

Hi @volodymyr-babak, I have observed another issue that might be related to this problem. Today the Cloud is unable to send the RPC requests to Devices connected to Edge even though the edge is connected. The rule chain message shows "NO_ACTIVE_CONNECTION". Please see the screenshot below.

image

I have tried to unassign and then assign all the users to Edge, but the issue persists. Link

volodymyr-babak commented 1 year ago

@akseerali

could you please check if you see RPC Call event in the Downlinks tab of the edge entity: image

akseerali commented 1 year ago

Hi @volodymyr-babak

No, it's not showing RPC call in the Downlinks section.

image

volodymyr-babak commented 1 year ago

@akseerali

It seems like you're using the cloud version of ThingsBoard along with a ThingsBoard PE Edge license. As such, you should have access to our ThingsBoard Customer Portal, available at https://thingsboard-portal.atlassian.net/browse/CP.

As the troubleshooting of this issue may require additional private information from you, I would suggest continuing our investigation on this closed portal to ensure your data privacy.

Please note, if the root of the issue turns out to be a bug within our platform, we will ensure to update this GitHub ticket with that information. This way, our broader user community can also benefit from the findings of our investigation.

Looking forward to assisting you further on the ThingsBoard Customer Portal.

akseerali commented 1 year ago

Thanks a lot @volodymyr-babak for the support. Our team will now go with the Customer Portal. Please note that in the docker compose file attached in link, I have add mentioned the additional volumes part by mistake. Please find attached the docker-compose file configuration used for the setup.

Extra configuration volumes: /media/iiotedge/sshd/tb-edge/.mytb-edge-logs:

docker compose updated.txt

volodymyr-babak commented 1 year ago

@akseerali

Thank you for providing the updated docker-compose file and the previous logs. I've reviewed the information, but the root cause of the disconnection issue is not immediately clear to me.

However, it's possible that the disconnections may be related to an issue that we've recently addressed and fixed in our latest release: https://github.com/thingsboard/thingsboard/pull/8346

We just updated our cloud to the 3.5 release yesterday, and the 3.5 Edge version will be publicly available today. We'll also update the documentation on our website accordingly.

Once these updates are live, I would kindly ask you to upgrade your version to 3.5.0 and monitor the behavior. If my assumption is correct, this upgrade should resolve the disconnection issues and you should no longer see the disconnects in your logs.

Please let us know if you continue to experience problems after this update. We are committed to ensuring the smooth operation of our service for your needs.

akseerali commented 1 year ago

Hi @volodymyr-babak,

We upgraded the TB Edge to version 3.5; however, this did not resolve the issue of sending the RPC request to Edge from Cloud. After this, we re-assigned the Devices group to edge and it worked. I think the upgrade of Edge instance also played its part because I had tried the same method with Edge version 3.4.3.

Regarding the disconnection/synchronization issue of edge with cloud, we'll continue to observe it for more days.

Many thanks.

akseerali commented 1 year ago

Hi @volodymyr-babak,

The NO_ACTIVE_CONNECTION RPC call to Device error appeared again when we tried to send the server RPC requests to Edge today. The issue is once again cleared after re-assigning the Devices group to Edge.

image

volodymyr-babak commented 1 year ago

Hello @akseerali,

I appreciate your patience as we work to resolve your issue.

To aid in our troubleshooting, could you please verify whether you can observe the RPC Call event under the Downlinks tab of the Edge entity? I'm currently trying to ascertain whether the issue originates from the Edge or if it lies within the cloud's capability to send the RPC Call event to the Edge.

For further investigation, I'll be running my own Edge demo overnight in an attempt to replicate the issue locally. I'm currently hypothesizing that the problem might be associated with the device session timeout. After a certain period, the cloud may begin to send RPC requests under the assumption that the device is directly connected to the cloud and not interfacing via the Edge.

I will share my findings and any potential solutions as soon as I have more information. In the meantime, I encourage you to check for the RPC Call event, as mentioned earlier, and report any findings.

Thank you for your understanding, and I look forward to resolving this issue promptly.

akseerali commented 1 year ago

Hi @volodymyr-babak,

Thanks for the information and efforts. I have double-checked the Downlinks tab under the Edge details option, and no RPC Call Event action was observed due to this error until the Devices group was re-assigned to Edge instance. You may be right, the issue can be related with session.

Please let me know in case of any findings. Many thanks

volodymyr-babak commented 1 year ago

Hi @akseerali,

I have a few clarifying questions that could help us diagnose this issue more effectively.

Firstly, do you have a single Edge entity in your system, or are there multiple ones? If there are multiple Edge entities, could you please verify if your device belongs to a group that is assigned exclusively to a single Edge entity? Additionally, it would be beneficial to ensure that this device doesn't belong to any other group that could potentially be assigned to another Edge.

These steps will help us isolate the problem more accurately. Looking forward to your response.

akseerali commented 1 year ago

Hi @volodymyr-babak,

We have only one edge entity in our system and the device is only assigned to this edge. I have few other observations regarding the error.

In our case, one Device is directly connected to Edge. The RPC NO_ACTIVE_CONNECTION error was appearing when we were assigning the Device Profile of type Default to that Device. This is probably due to the session timeout.

I have now changed the Device Type to MQTT 2-3 days ago and so far no RPC error is appearing. Please see the attached diagram of system architecture. One more thing, this issue only appeared after the update of Cloud version. I'll continue to observe it after the changings. Many thanks

image

akseerali commented 1 year ago

Hello @volodymyr-babak,

Today the postgres container is showing an error after updating and upgrading some file in the Ubuntu system. Could you please mention how to clear this issue? I have restarted the container, but the issue persists. Please see the logs below.

`PostgreSQL Database directory appears to contain a database; Skipping initialization

2023-06-21 12:43:05.832 IST [1] LOG: starting PostgreSQL 12.14 (Debian 12.14-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit 2023-06-21 12:43:05.832 IST [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2023-06-21 12:43:05.833 IST [1] LOG: listening on IPv6 address "::", port 5432 2023-06-21 12:43:05.834 IST [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2023-06-21 12:43:05.845 IST [27] LOG: database system shutdown was interrupted; last known up at 2023-06-21 11:35:41 IST 2023-06-21 12:43:05.939 IST [27] LOG: invalid primary checkpoint record 2023-06-21 12:43:05.939 IST [27] PANIC: could not locate a valid checkpoint record 2023-06-21 12:43:06.031 IST [1] LOG: startup process (PID 27) was terminated by signal 6: Aborted 2023-06-21 12:43:06.031 IST [1] LOG: aborting startup due to startup process failure 2023-06-21 12:43:06.032 IST [1] LOG: database system is shut down iiotedge@OptiPlex-5060-C125:~$

`

volodymyr-babak commented 1 year ago

Hello @akseerali,

Are these the complete logs for the PostgreSQL container? If not, could you please provide the full logs for a more comprehensive overview?

Additionally, could you clarify the exact steps you've undertaken when you refer to 'updating and upgrading some file in the Ubuntu system'? Providing these details will allow for a more accurate analysis and assist in identifying the issue at hand.

Thank you.

akseerali commented 1 year ago

Hi @volodymyr-babak,

Please find attached the postgres container logs. postgres-container-logs.txt

I have noticed that an old postgres container (used for upgrading the PE Edge from 3.4 to 3.5) was somehow started. I have now stopped the container. Below are the commands used in Ubuntu system.

sudo apt-get update sudo apt-get upgrade

akseerali commented 1 year ago

Hi @volodymyr-babak,

Please let me know if the use of backup database can fix this issue. The backup was saved during the upgrade of Edge instance.

volodymyr-babak commented 1 year ago

Hello @akseerali

according to Postgres container logs, checkpoint file is corrupted and postgres is not able to start because of this.

https://sysopspro.com/fix-postgresql-error-panic-could-not-locate-a-valid-checkpoint-record/

According to this article, you will need to login into postgres container and reset log file by exiting command:

/usr/bin/pg_resetxlog -f /path/to/pg/data/directory

Please try this and let me know your results.

akseerali commented 1 year ago

Hi @volodymyr-babak

Since the postgres container was restarting after every few seconds, login into container was not possible. I created a temporary container to reset logs as per steps in below figure. This did not resolve the issue. image

From this topic, I have found a way to reset the Postgres database log file in a docker container. Please see the steps below. image

The above method cleared the log error, but now there are some other errors observed in the Postgres container. The Edge is also not working properly. Please see the attached Edge and Postgres container logs. edge-logs.txt postgres-logs.txt

I think there is an issue with database. Please let me know if I can just use previous backup or create a new database to clear the issue. The PE Edge is newly deployed, so the old data is not an issue. Thanks

volodymyr-babak commented 1 year ago

Hello @akseerali

indeed looks some database issue and some files/permissions are corrupted. Please let me know how did you do backup of your database before upgrading? What was the command? How your back looks in terms of folders - what's inside that folders? Thanks.

akseerali commented 1 year ago

Hi @volodymyr-babak

I have followed these instructions to backup the database and the command used is mentioned below. sudo cp -r ~/.mytb-edge-data/db ~/.mytb-edge-db-BACKUP

Below is the screenshot of database folder. rHrOT8W4fz

volodymyr-babak commented 1 year ago

thanks for the provided information.

in this case you can try to do the following:

  1. do backup pf your current broken folder, just in case

sudo cp -r ~/.mytb-edge-data/db ~/.mytb-edge-db-BACKUP-BROKEN

  1. remove your current data folder

sudo rm -rf ~/.mytb-edge-data/db

  1. copy your previous backup into data folder

sudo cp -r ~/.mytb-edge-db-BACKUP ~/.mytb-edge-data/db

  1. modify your docker-compose.yml and set version of the edge to the one, that successfully worked with backup folder before update

  2. docker compose stop

  3. docker compose rm

  4. docker compose up -d

  5. docker compose logs

Once you'll do these steps, please let me know the results. But please be careful during these steps to not remove working backup, that is currently in place.

akseerali commented 1 year ago

Hi @volodymyr-babak

Thanks for the detailed steps. The use of backup database solves the error; however, when I upgrade the edge from 3.4.3EDGEPE to 3.5.0EDGEPE or 3.5.1EDGEPE, the upgrade process shows the error. Please find attached the edge container logs when I tried to upgrade from 3.5.3EDGE to 3.5.0EDGEPE. tbedge upgrade logs.txt

With 3.4.3EDGEPE version, the instance is running like pre-upgrade time. I think the only stable way now is to use a new database and use the latest EDGEPE version.

volodymyr-babak commented 1 year ago

Hello @akseerali,

Based on the logs, it seems the system is not upgrading from version 3.4.3 to 3.5.0 as expected. Could you please check the contents of the following file in the edge container: /data/.upgradeversion

If it's not set to 3.4.3, please adjust it to reflect 3.4.3 and initiate the upgrade procedure following the steps provided here: https://thingsboard.io/docs/user-guide/install/pe/edge/upgrade-instructions/#docker-linux-mac-35

akseerali commented 1 year ago

Hi @volodymyr-babak

Thanks. After changing 3.5.0 to 3.4.3 in the /data/.upgradeversion file inside the Edge container, edge is finally upgraded with new version. The new version also solves the edge connectivity problem, so I am closing this issue.

Thanks again

akseerali commented 1 year ago

Hi @volodymyr-babak

A TB Edge PE synchronization issue is observed on 01/07/2023.

image

image

Question: How to avoid this kind of issue in a production environment in future? Please find attached the edge container logs. Thanks

Edge version 3.5.1EDGEPE tb-edge.log

akseerali commented 1 year ago

Please note, after some time, TB Cloud is again showing that only one Device is active despite of receiving the telemetry data of other Devices from the Edge instance.

volodymyr-babak commented 1 year ago

Hey @akseerali ,

I noticed errors in the logs that could be a major communication bug in the most recent release:

2023-07-01 17:00:08,479 [cloud-manager-71-thread-1] INFO  o.t.s.s.cloud.CloudManagerService - Resetting seqIdOffset - new cycle started
2023-07-01 17:00:08,482 [cloud-manager-71-thread-1] WARN  o.t.s.s.cloud.CloudManagerService - Failed to process messages handling!
java.lang.IndexOutOfBoundsException: Index: -1
    at java.base/java.util.Collections$EmptyList.get(Collections.java:4483)

and

2023-07-01 16:58:46,293 [grpc-default-executor-66] ERROR o.t.s.s.cloud.CloudManagerService - [dd8f4df7-bcbe-b548-70f5-bc2b400fb8a4] Msg processing failed! Error msg: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition.

I plan to investigate these issues and prepare a hotfix for the 3.5.1 release. I'll update this ticket as soon as the hotfix is ready. My goal is to have the hotfix released by tomorrow.

volodymyr-babak commented 1 year ago

The hotfix for the Community Edition, CE 3.5.1.1, has been completed and released. You can find it at this link: https://github.com/thingsboard/thingsboard-edge/releases/tag/v3.5.1.1

The specific commit that addresses the IndexOutOfBoundException issue can be found here: https://github.com/thingsboard/thingsboard-edge/commit/db947eccd63e4b9d498d37c213c95d9f73a2124c

The Professional Edition hotfix, PE 3.5.1.1, is on its way and will be available soon.

volodymyr-babak commented 1 year ago

The Professional Edition (PE 3.5.1.1) hotfix has also been released. Please follow the upgrade guide to update your ThingsBoard Edge instances. It's worth noting that this update doesn't require a database update, only a package update, so it should be a quick process.

For the Community Edition (CE) upgrade instructions, follow this link: https://thingsboard.io/docs/user-guide/install/edge/upgrade-instructions/#upgrading-to-3511

For the Professional Edition (PE) upgrade instructions, refer to this link: https://thingsboard.io/docs/user-guide/install/pe/edge/upgrade-instructions/#upgrading-to-3511

Should you encounter any issues after the update, please don't hesitate to inform me. I apologize for any inconveniences this bug may have caused.

akseerali commented 1 year ago

Hi @volodymyr-babak

We have updated the edge version to 3.5.1.1 on 7th July 2023 and found that the cloud is again having a synchronization issue with edge instance on 8th July 2023. From Cloud, the Edge downlinks section was not sending any updates including the RPC requests (please see the attached figure). image

To clear this issue, I tried to restart the docker compose; however, it didn't solve the issue and I had to re-assign the device group to the edge instance.

It should be noted that the cloud was again able to receive the telemetry data and the Edge status was showing active at both edge and cloud.

From the edge container, I have found below error. I have also attached complete logs. Please fix the synchronization issue between edge and cloud.

tb-edge | 2023-07-08 12:09:13,430 [grpc-default-executor-10] WARN o.t.edge.rpc.EdgeGrpcClient - [dd8f4df7-bcbe-b548-70f5-bc2b400fb8a4] Stream was terminated due to error: tb-edge | io.grpc.StatusRuntimeException: CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL tb-edge | at io.grpc.Status.asRuntimeException(Status.java:535) tb-edge | at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:479) tb-edge | at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:463) tb-edge | [edge-logs-v3.5.1.1.txt](https://github.com/thingsboard/thingsboard-edge/files/11992440/edge-logs-v3.5.1.1.txt) at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:427) tb-edge | at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:460) tb-edge | at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562) tb-edge | at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) tb-edge | at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743) tb-edge | at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722) tb-edge | at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) tb-edge | at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) tb-edge | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) tb-edge | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) tb-edge | at java.base/java.lang.Thread.run(Thread.java:829) tb-edge | 2023-07-08 12:09:14,276 [sql-log-1-thread-1] INFO o.t.s.dao.sql.TbSqlBlockingQueue - Queue-0 [Events] queueSize [0] totalAdded [67] totalSaved [67] totalFailed [0] tb-edge | 2023-07-08 12:09:14,276 [sql-log-1-thread-1] INFO o.t.s.dao.sql.TbSqlBlockingQueue - Queue-1 [Events] queueSize [0] totalAdded [18] totalSaved [18] totalFailed [0] tb-edge | 2023-07-08 12:09:14,276 [sql-log-1-thread-1] INFO o.t.s.dao.sql.TbSqlBlockingQueue - Queue-2 [Events] queueSize [0] totalAdded [18] totalSaved [18] totalFailed [0] tb-edge | 2023-07-08 12:09:14,363 [sql-log-1-thread-1] INFO o.t.s.dao.sql.TbSqlBlockingQueue - Queue-0 [Attributes] queueSize [0] totalAdded [10] totalSaved [10] totalFailed [0] tb-edge | 2023-07-08 12:09:14,363 [sql-log-1-thread-1] INFO o.t.s.dao.sql.TbSqlBlockingQueue - Queue-1 [Attributes] queueSize [0] totalAdded [11] totalSaved [11] totalFailed [0] tb-edge | 2023-07-08 12:09:16,437 [cloud-manager-reconnect-72-thread-1] INFO o.t.s.s.cloud.CloudManagerService - Trying to reconnect due to the error: io.grpc.StatusRuntimeException: CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL! tb-edge | 2023-07-08 12:09:16,444 [cloud-manager-reconnect-72-thread-1] INFO o.t.edge.rpc.EdgeGrpcClient - [dd8f4df7-bcbe-b548-70f5-bc2b400fb8a4] Sending a connect request to the TB! edge-logs-v3.5.1.1.txt

volodymyr-babak commented 1 year ago

Hi @akseerali,

I think it would be beneficial for us to set up a short call to troubleshoot this situation, as I am currently unable to clearly understand the steps needed to reproduce the issue. I've been running a personal PE Edge for a month now, and have successfully been able to send RPC requests to the device every single day. It seems like I might be missing a step to reproduce this correctly.

Could you kindly send me an email to the address mentioned in my profile? We can coordinate the details of our call via email.

Thank you in advance for your cooperation.

akseerali commented 1 year ago

Hi @volodymyr-babak

I have been testing the connectivity of RP requests for the Device connected to TB PE Edge from Cloud and it’s working fine with below setup.

• Edge PE version 3.5.1.1 • Created a Device from Edge instance. This created a device group starting with name “[Edge]” • This way “NO_ACTIVE_CONNECTION” issue is resolved that was appearing after every few days.

I observed that this problem arose specifically when employing RPC requests with a Device that originated from the Cloud and was allocated to any Device group other than an Edge Device group beginning with "[Edge]".

Thank you so much for the support.

volodymyr-babak commented 1 year ago

@akseerali

Thank you for the updates. I am reopening the ticket to re-examine this theory, specifically looking at multiple device groups other than those that begin with "[Edge]."

AndreMaz commented 1 year ago

Not sure if fully related but I also see RPC Error: NO_ACTIVE_CONNECTION when making an RPC request via TB-Cloud (making the same RPC request to TB-Edge works just fine.)

The strange part is that TB-Cloud and TB-Edge connection is ok as they successfully exchange ping reqs.

image

Also, the device seems active both at TB-Cloud and TB-Edge image

More context:

volodymyr-babak commented 1 year ago

@akseerali @AndreMaz

Thank you for all the input. I believe I've finally identified the root cause of the issue. In cases where a device is not created over the edge but is created on the cloud and then assigned to the edge, a specific "ManagedByEdge" relation from the device to the edge is not created automatically. However, this relation is essential in the DeviceActor to find the related edge and send RPC commands to it.

As a temporary fix, please add the following relation from the device to the required edge:

2023-09-08_17-44

Please let me know if this update resolves your issues. In the meantime, I will consider ways to improve this approach to eliminate the need for manually adding this relation while still achieving the expected functionality.

bcblr1993 commented 1 month ago

@akseerali I am currently using ThingsBoard CE version 3.4.3 and encountering the same issue as you. May I ask if you were able to resolve this issue eventually? Additionally, it is difficult for me to reproduce the problem as it only occurs after running for a while. I look forward to your response.

akseerali commented 1 month ago

@bcblr1993

We resolved this issue by following the instructions in the last comment. See the details below.

Thank you for all the input. I believe I've finally identified the root cause of the issue. In cases where a device is not created over the edge but is created on the cloud and then assigned to the edge, a specific "ManagedByEdge" relation from the device to the edge is not created automatically. However, this relation is essential in the DeviceActor to find the related edge and send RPC commands to it.

As a temporary fix, please add the following relation from the device to the required edge:

  • Direction: Must be set to 'TO'
  • Type: Must be set to 'ManagedByEdge'

2023-09-08_17-44

Please let me know if this update resolves your issues. In the meantime, I will consider ways to improve this approach to eliminate the need for manually adding this relation while still achieving the expected functionality.

bcblr1993 commented 1 month ago

@akseerali Thank you for your reply. I’ll give it a try.

bcblr1993 commented 1 month ago

@akseerali Also, I would like to ask if this issue has been resolved in version 3.5.1 that you are using, as I am using the CE version?

My current issue is that, after a while, the edge device continues to send telemetry data normally, but the cloud shows the edge device’s active attribute as false, and I am unable to send commands or synchronize. I am currently using version 3.4.3 CE.