[Bug] CoAP observation drops

jairohg commented 3 years ago

Bug CoAP observation relationship is "lost" ~1min after being established. While trying to Observe coap://host/api/v1/$ACCESS_TOKEN/rpc

Server Confirmed on

demo.thingsboard.io and
ThingsBoard PE 3.3.0 running on Ubuntu 20.04.3 (Docker monolith)

Your Device

Connectivity
- CoAP Reproducible with "coap-client" from libcoap 4.3.0 running on linux (Ubuntu 20.04.3)

To Reproduce Steps to reproduce the behavior:

Using the sample widget "send rpc" on a dashboard. Attempt to send an RPC command to coap-client, which is subscribed by launching the process with the following syntax: ./coap-client -m get coap://demo.thingsboard.io/api/v1/$ACCESS_TOKEN/rpc -s 720 -B 720
Click on "send rpc". Which should result in the following output being printed (stdout) by the coap-client process: {"id":1,"method":"rpcCommand","params":{}}
Wait ~1min or longer and click "send rpc". This command will never reach the coap-client.

Expected behavior CoAP observation should not be dropped without notifying the observer.

Relevant logs 2021-09-21 13:37:38,274 [DefaultTransportService-22-6] INFO o.e.californium.core.CoapResource - successfully established observe relation between 172.19.0.1:36342#BEEFFEED and resource /api/v1 (Exchange[R1132], size 33)

after failing to send RPC to device

2021-09-21 13:45:06,488 [CoapServer(main)#2] INFO o.e.c.c.network.stack.ObserveLayer - notification for token [Token=BEEFFEED] timed out. Canceling all relations with source [/172.19.0.1:36342] 2021-09-21 13:45:06,489 [CoapServer(main)#2] INFO o.e.californium.core.CoapResource - remove observe relation between 172.19.0.1:36342#BEEFFEED and resource /api/v1 (Exchange[R1132, complete], size 32) 2021-09-21 13:45:06,489 [CoapServer(main)#2] ERROR o.e.c.c.n.stack.ReliabilityLayer - Exception for Exchange[R1132, complete] in MessageObserver: null java.lang.NullPointerException: null at org.thingsboard.server.transport.coap.client.DefaultCoapClientContext.cancelRpcSubscription(DefaultCoapClientContext.java:741) at org.thingsboard.server.transport.coap.client.DefaultCoapClientContext.deregisterObserveRelation(DefaultCoapClientContext.java:176) at org.thingsboard.server.transport.coap.CoapTransportResource$CoapResourceObserver.removedObserveRelation(CoapTransportResource.java:504) at org.eclipse.californium.core.CoapResource.removeObserveRelation(CoapResource.java:778) at org.eclipse.californium.core.observe.ObserveRelation.cancel(ObserveRelation.java:151) at org.eclipse.californium.core.observe.ObservingEndpoint.cancelAll(ObservingEndpoint.java:74) at org.eclipse.californium.core.observe.ObserveRelation.cancelAll(ObserveRelation.java:162) at org.eclipse.californium.core.network.stack.ObserveLayer$NotificationController.onTimeout(ObserveLayer.java:233) at org.eclipse.californium.core.coap.Message.setTimedOut(Message.java:954) at org.eclipse.californium.core.network.Exchange.setTimedOut(Exchange.java:707) at org.eclipse.californium.core.network.stack.ReliabilityLayer$RetransmissionTask.retry(ReliabilityLayer.java:524) at org.eclipse.californium.core.network.stack.ReliabilityLayer$RetransmissionTask.access$200(ReliabilityLayer.java:430) at org.eclipse.californium.core.network.stack.ReliabilityLayer$RetransmissionTask$1.run(ReliabilityLayer.java:467) at org.eclipse.californium.elements.util.SerialExecutor$1.run(SerialExecutor.java:289) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

Additional context Somewhat noteworthy is the fact that monitoring the packets in/out we could not observe any outgoing packets for the failed RPC to device. Additionally, this interaction is on IPv6, and yet the log shows the observation being mapped to IPv4.

jairohg commented 3 years ago

For anyone interested, this issue appears to be related to connection tracking on the NAT performed by the docker network (the coap packet is sent by the application and then lost in the docker network as the connection tracker on the host released the mapping for the client). Currently looking into other micro service deployment options using kubernetes. Help/pointers from the developers would be appreciated..

ashvayka commented 3 years ago

Hi @jairohg , I will provide more comments about this issue tomorrow morning. This is indeed related to NAT but not only to NAT. It is also about the routing tables on many load balancers. It is a long story but we have a solution. Stay tuned for updates.

ashvayka commented 3 years ago

We have faced a similar issue during testing of some NB IoT devices which communicate over CoAP and LwM2M. Even if the device is not behind the NAT, the issue with the Load Balancer implementation is present.

AWS NLB During our experiments, we have found that the default AWS NLB implementation "remembers" the route between the client ip+port and the k8s instance for 2 minutes. After ~2-3 minutes of inactivity, whatever downlink we send to the client, is not delivered. We have not found any configuration property for the lifetime of the session that works for our case. In an attempt to solve the problem, we decided to try Nginx LB which has many configuration options.

Nginx LB We have allocated Nginx LB with the static IP address and forwarded traffic to our K8S services in the same VPC that are exposed as NodePorts. This solution allows you to configure the "session" expiration time or route table to 24 and more hours. This works well until your target servers (ThingsBoard CoAP Transport) are restarted due to an upgrade or outage. In such a case the Load Balancer should forward the traffic for existing "sessions" to the new server. Unfortunately, this was not the case. The traffic was forwarded for new but not old clients. Old clients traffic was forwarded to the wrong server and was never delivered. Theoretically, it is possible to configure Nginx to close the session using proxy_responses setting and close the connections using proxy_session_drop which is part of commercial subscription btw. But for our use case, these settings are not perfect for multiple reasons: 1) Clients may want to use non-confirmable packets for telemetry to optimize traffic consumption. In such a case, the proxy_responses setting does not work for us. The Load Balancer simply does not know should it close the session or not. 2) The amount of traffic should not increase due to server failures or restarts. That is why the only possible solution we found is to create our own UDP Load Balancer. It may be not as fast as Nginx or competitors, but the amount of traffic from the NB IoT devices is relatively low. We just need more control over "session" management.

ThingsBoard UDP LB Simple LB implementation which is open-source and powered by great frameworks. It has certain limitations but works best for our use case. A single server may handle up to 50K concurrent UDP sessions (since each session occupies one client port on the load balancer). The server reacts to the DNS address changes and updates the routing table and the data is forwarded to a new server(s). LwM2M transport remembers all active sessions so the clients are not affected. What we plan to do next: 1) Cache the routing table in Redis or local files. So the restart of the load balancer will not affect our setup. 2) Cache the subscriptions info on the ThingsBoard CoAP Transport. Same as for LwM2M. To remember the RPC/Attirubte subscriptions for the client. This way we optimize the traffic and the devices may be super constrained without any complex logic about reconnects/etc.

WillNilges commented 3 years ago

The ThingsBoard UDP LB is only 8 days old. Is it ready for production use?

ashvayka commented 3 years ago

We are constantly running tests on it. So far so good. Of course, it is super young, but it solves our main problem at the moment. there might be better solutions out there, and don't get me wrong, we definitely don't plan to compete (yet :-) ) with other load balancers on the market. The AWS LB and Nginx are much better for most of the use cases. It is just about fixing the specific issue with non-confirmable packets and session TTL for our case.

Besides, we would be very happy if we could avoid writing and supporting our own load balancer at this moment.

WillNilges commented 3 years ago

Makes sense. Do you have advice on on how to use it with a baremetal K8s deployment? I've been working on getting MQTT and COAP working with Ingress-Nginx with limited success. My plan this week is to see if I can get something working with MetalLB instead, and then use Ingress-Nginx to route traffic from there.

jairohg commented 3 years ago

hi @ashvayka thanks for putting this together. Would you mind sharing details on your test setup for the UDP load balancer? These could serve as sample configs that one could try locally to provide another data point.

WillNilges commented 2 years ago

I want to add some instructions on how to build and run the loadbalancer. I got the image built and pushed, but I don't know how to deploy it.

https://github.com/thingsboard/thingsboard-udp-loadbalancer/pull/6

WillNilges commented 2 years ago

I suppose it's worth mentioning the setup that I see my issues on. I'm running an RKE v1.21.5 cluster on Fedora Server 34, virtualized on self-hosted Proxmox. I'm using a modified version of the minikube scripts located here to do so. With a Calico networking addon, and the ThingsBoard COAP transport patched through a NodePort, I'm seeing the same behavior.

@ashvayka do you think you can share any documentation or test results you have from this project? As far as we know, we've seen this issue in all environments and all deployments. I'd love to work with you in trying to get this LB working on Kube, but I'm confused as to where exactly this LB is supposed to run. What exactly are the issues posed by running this containerized? Is it supposed to be installed on a services VM like a typical HAProxy deployment does?

It sounds like it's supposed to be installed on a host system and then run as a daemon or something, but the presence of Dockerfiles and the implication that this can run in conjunction with a microservices deployment is confusing me.

WillNilges commented 2 years ago

I have an update: @jairohg and I tried using CoAP with and without DTLS on a K8s Thingsboard deployment (same specs as above: RKE v1.21.5 cluster on Fedora Server 34, virtualized on self-hosted Proxmox) using MetalLB. The agent we're using connected OK, but RPC still dies about 2 minutes in. So MetalLB also doesn't seem to fit the bill.

I don't quite understand why we're seeing this issue even when a baremetal deployment is using a NodePort to pass traffic to the CoAP transport. We're sure that it's Docker? How exactly does the Thingsboard UDP LB address the issue if it's not required to run in a docker container, instead running on the host as a .deb package?

ashvayka commented 2 years ago

Hi @WillNilges , my 2 cents about our "coap.thingsboard.cloud" setup: at the moment the LB is installed on AWS Ubuntu VMs with elastic IPs. It forwards the traffic to LwM2M pods using Node Port. The LB is "remembering" the routing table, which consists of: A) Source IP and port of the device. B) Dest IP and port of the Node; The LB is configured to remember the sessions for 1 hour. So, when the Node has an update, we make sure we push it from the correct LB IP and Port, and not from the AWS NAT Gateway.

Before the LB, we were still publishing the update from the node, but it was sent from the wrong IP (not from the LB IP, which received the packet, but from the AWS NAT Gateway). The client was ignoring the update since the IP of the originator of the CoAP/UDP packet was different.

WillNilges commented 2 years ago

Got an update: I actually managed to pass traffic from the LB to my cluster using this. I have my cluster on one Fedora VM, and my LB is on a separate Ubuntu VM. However, I dont think this is the correct configuration, because while we can pass traffic, we're still seeing this CoAP Observation dropping issue. Do I need to... somehow... install the lb on the cluster and point it at the CoAP service ClusterIP?

galenguyer commented 2 years ago

Hey, I want to raise this issue again. I've taken over from @WillNilges and am still seeing the issue with traffic dropping using release 1.0.0 of the tb-udp-lb

thedevleon commented 1 year ago

Hi, I realize this is a relatively old issue, however, I'm seeing the exact same behavior on a monolithic instance of Thingsboard CE v3.4.4-1.

In the first minute, RPCs work fine, however, at some point after that, a packet seems to be dropped (nothing arrives at my board), which triggers the timeout and the deregistering of the observer.

2023-03-14 14:41:27,518 [http-nio-0.0.0.0-8080-exec-5] INFO  o.t.s.c.plugin.TbWebSocketHandler - [8ba88830-b163-11ed-9887-a7083c98d4dd][0f3340a2-14a9-4c5b-ae37-2918d9ec0de3][29a6928d-5e6a-19a5-84c2-bd7978deed54] Session is opened from address: /46.244.106.52:65430
2023-03-14 14:42:23,031 [DefaultTransportService-23-15] INFO  o.e.californium.core.CoapResource - successfully established observe relation between 46.244.106.52:3569#BBA240B1EBE64278 and resource /api/v1 (Exchange[R430746], size 202)
2023-03-14 14:42:33,037 [device-dispatcher-0-4] INFO  o.t.s.a.d.DeviceActorMessageProcessor - [85665b10-b9aa-11ed-9887-a7083c98d4dd][5a999224-a56e-43c7-92fb-98d408e91fa1] Rpc has already removed from pending map.
2023-03-14 14:42:57,456 [device-dispatcher-1-1] INFO  o.t.s.a.d.DeviceActorMessageProcessor - [85665b10-b9aa-11ed-9887-a7083c98d4dd][9ec31dba-63c5-42c6-8f99-6d04a1470a4c] Rpc has already removed from pending map.
2023-03-14 14:43:12,629 [device-dispatcher-2-2] INFO  o.t.s.a.d.DeviceActorMessageProcessor - [85665b10-b9aa-11ed-9887-a7083c98d4dd][8a77d277-de6e-4dc6-a9fa-661a37397fc0] Rpc has already removed from pending map.
2023-03-14 14:43:26,835 [CoapServer(main)#1] INFO  o.e.c.c.network.stack.ObserveLayer - notification for token [Token=BBA240B1EBE64278] timed out. Canceling all relations with source [46.244.106.52:3569]
2023-03-14 14:43:26,835 [CoapServer(main)#1] INFO  o.e.californium.core.CoapResource - remove observe relation between 46.244.106.52:3569#BBA240B1EBE64278 and resource /api/v1 (Exchange[R430746, complete], size 201)
2023-03-14 14:43:26,837 [CoapServer(main)#1] WARN  o.e.c.c.network.stack.BaseCoapStack - error send response CON-2.02   MID=   -1, Token=BBA240B1EBE64278, OptionSet={}, <empty data>
org.eclipse.californium.core.network.ExchangeCompleteException: Exchange[R430746, complete] is already complete! CON-2.02   MID=   -1, Token=BBA240B1EBE64278, OptionSet={}, <empty data>
        at org.eclipse.californium.core.network.Exchange.assertIncomplete(Exchange.java:1314)
        at org.eclipse.californium.core.network.InMemoryMessageExchangeStore.registerWithMessageId(InMemoryMessageExchangeStore.java:240)
        at org.eclipse.californium.core.network.InMemoryMessageExchangeStore.registerOutboundResponse(InMemoryMessageExchangeStore.java:411)
        at org.eclipse.californium.core.network.UdpMatcher.sendResponse(UdpMatcher.java:161)
        at org.eclipse.californium.core.network.CoapEndpoint$OutboxImpl.sendResponse(CoapEndpoint.java:879)
        at org.eclipse.californium.core.network.stack.BaseCoapStack$StackBottomAdapter.sendResponse(BaseCoapStack.java:239)
        at org.eclipse.californium.core.network.stack.ReliabilityLayer.sendResponse(ReliabilityLayer.java:150)
        at org.eclipse.californium.core.network.stack.BlockwiseLayer.sendResponse(BlockwiseLayer.java:688)
        at org.eclipse.californium.core.network.stack.ObserveLayer.sendResponse(ObserveLayer.java:125)
        at org.eclipse.californium.core.network.stack.AbstractLayer.sendResponse(AbstractLayer.java:74)
        at org.eclipse.californium.core.network.stack.ExchangeCleanupLayer.sendResponse(ExchangeCleanupLayer.java:95)
        at org.eclipse.californium.core.network.stack.BaseCoapStack$StackTopAdapter.sendResponse(BaseCoapStack.java:197)
        at org.eclipse.californium.core.network.stack.BaseCoapStack.sendResponse(BaseCoapStack.java:113)
        at org.eclipse.californium.core.network.CoapEndpoint$4.run(CoapEndpoint.java:684)
        at org.eclipse.californium.elements.util.SerialExecutor$1.run(SerialExecutor.java:290)
        at org.eclipse.californium.core.network.CoapEndpoint$6.run(CoapEndpoint.java:1311)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Is there maybe a way to disable to instant de-registering after a single timeout?

thedevleon commented 1 year ago

Quick update from my side, the issue was due to a missing ACK to the RPC CON message from the server, which ultimately caused the timeout.

Still, I wonder what this behavior will mean if a packet from the observation doesn't reach the client (i.e. because a packet has been dropped) - immediately removing the observation seems a bit drastic.

Where could I make an adjustment so that it takes a bigger number of un-acknowledged messages (for example, 5 successive messages) before the observer relation is removed?

Edit:

From further digging, it seems the configuration flags NOTIFICATION_CHECK_INTERVAL and NOTIFICATION_CHECK_INTERVAL_COUNT from californium might be relevant.

https://github.com/thingsboard/thingsboard/blob/1be933622db1157f79a2d790e5547984344c7387/common/coap-server/src/main/java/org/thingsboard/server/coapserver/DefaultCoapServerService.java#L97

thingsboard / thingsboard

[Bug] CoAP observation drops #5263