redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.13k stars 835 forks source link

Streams Mode Stops Processing #479

Closed louisburton closed 4 years ago

louisburton commented 4 years ago

Using v3.22.0 in Streams mode. I have 4 basic streams, that essentially have 4 separate MQTT client inputs, that perform some basic transformations, and output to their respective Kafka topics. Buffers are not currently used, as I don't want to risk message loss, and prefer to use MQTT's message delivery guarantees and pass straight through to Kafka.

Unfortunately, if I fire a constant stream of messages at one of these streams, eventually the client for that stream disconnects and none of the streams appear to process anything. This is only resolved by bouncing the benthos process.

I have reproduced this with the debug endpoints enabled, and so have tried to pull out some pprof information. Unfortunately though, only the heap and stacktrace seemed to show anything interesting. For block to work, I think runtime.SetBlockProfileRate needs to be called. The CPU profiles also seemed to show 0 samples no matter the duration, and I'm unsure why - but I am a pprof novice.

We deploy benthos via a container, and have used the HEALTHCHECK with /ping to allow us to restart if unresponsive, but unfortunately the ping result says the service is okay, and looking at the streams says they're all okay also:

{"connections_stream":{"active":true,"uptime":39030.344002114,"uptime_str":"10h50m30.344003157s"},"logs_stream":{"active":true,"uptime":39030.337605406,"uptime_str":"10h50m30.337606201s"},"registry_stream":{"active":true,"uptime":39030.335008261,"uptime_str":"10h50m30.335009044s"},"services_stream":{"active":true,"uptime":39030.332128346,"uptime_str":"10h50m30.332129133s"}}

I know the /ready endpoint would normally check if the input was connected, and thus as I can see from the MQTT broker dashboard that the logs_stream's mqtt client is no longer connected - I'd hope this would flag as not ready. If I'd then hooked this into a HEALTHCHECK I'd be confident that if this situation did arise, the container would restart and recover. Unfortunately though, this ticket means I can't leverage this : https://github.com/Jeffail/benthos/issues/477

I have attached what I have obtained from pprof, and my configs. Any advice would be greatly appreciated. If a configuration would alleviate this, or if there's something else we could do to mitigate this, we could try. The metrics are great for trying to alarm this scenario - but we are trying to use this as a critical piece of connecting infrastructure and so we're a bit nervous for the stalling.

Thanks again for all your efforts and support! 😄

configs.zip pprofAfterStartup.zip pprofAfterProcessingStop.zip

louisburton commented 4 years ago

For clarity, here's a stack trace where the behaviour still seems stalled, but it's running with only one stream in streams mode. hungStackSingleStream.txt

Jeffail commented 4 years ago

Thanks @louisburton, that appears to reaffirm my suspicions, my guess is that something within the MQTT client library is causing SetOnConnectHandler to not get called under certain reconnect conditions. Performing manual reconnections will guarantee we re-establish all of our subscriptions so we might get lucky and the problem goes away.

If we still see the stalling with https://github.com/Jeffail/benthos/commit/bf21eea014c174c26a8c6ac65c608d11a605755e then more information that might help is logs from your MQTT broker. I can also put more time into trying to reproduce the stall locally, more information about the broker you're using and how it's configured would help for that.

louisburton commented 4 years ago

Unfortunately was still able to reproduce with the mqtt-stalls branch. Again, the benthos stream is reporting as active, but the client has disconnected from the broker and is no longer processing. hungStackSingleStreamMqttStallsBranch.txt

The broker we're using is EMQ X (in this case, v3.2.3). In the debug logs from earlier in the evening I see messages such as this reflecting the stream client connecting : 2020-07-21 18:24:24.849 [debug] 10.17.42.250:34094 [Protocol] RECV CONNECT(Q0, R0, D0, ClientId=expeditor_backbone_client-logs, ProtoName=MQTT, ProtoVsn=4, CleanStart=false, KeepAlive=30, Username=$my-user, Password=******) But nothing later, nor around the time of the stalling.

We actually run it as a 3 node cluster of EMQ X Broker (Docker 3.2.3) deployed to AWS ECS (via DNS), and the Benthos service is a separate single node ECS service.

Config values:

broker.sys_interval = 1m
broker.sys_heartbeat = 30s
EMQX_ACL_NOMATCH | deny
EMQX_ALLOW_ANONYMOUS | false
EMQX_AUTH__HTTP__ACL_REQ | http://mqauth.platform.eu-west-1.mycloud:8080/acl
EMQX_AUTH__HTTP__ACL_REQ__METHOD | post
EMQX_AUTH__HTTP__AUTH_REQ | http://mqauth.platform.eu-west-1.mycloud:8080/connect
EMQX_AUTH__HTTP__AUTH_REQ__METHOD | post
EMQX_AUTH__HTTP__SUPER_REQ | http://mqauth.platform.eu-west-1.mycloud:8080/superuser
EMQX_AUTH__HTTP__SUPER_REQ__METHOD | post
EMQX_CLUSTER__DISCOVERY | dns
EMQX_CLUSTER__DNS__APP | mqttcluster
EMQX_CLUSTER__DNS__NAME | mqtt.platform.eu-west-1.mycloud
EMQX_CLUSTER__NAME | mqttcluster
EMQX_LISTENER__API__MGMT | 0.0.0.0:8080
EMQX_LISTENER__TCP__EXTERNAL | 0.0.0.0:1883
EMQX_LISTENER__WS__EXTERNAL | 8083
EMQX_LOADED_PLUGINS | emqx_recon,emqx_retainer,emqx_management,emqx_dashboard,emqx_auth_http
EMQX_LOG__LEVEL | debug
EMQX_NAME | mqttcluster

Thanks for all your help. Happy to try other branches if this helps.

Jeffail commented 4 years ago

Hey @louisburton, I've added two possible work-arounds to the mqtt-stalls branch. Neither of these options are ideal but it will hopefully solve this until I have time to dig into the client lib some more.

The first is an async poll for whether the client thinks it's connected, if the bug here is that the client lib knows it's disconnected but is failing to trigger the right callbacks then this will solve the issue. You can run the latest commit on the branch with no config changes and this behaviour will kick in.

If, however, the bug is that the client isn't even aware itself that it's not connected then the only way we can detect that is by the fact that no messages are coming through. In order to work around that I've added a config field stale_connection_timeout, which is an optional timeout duration you can specify whereby if no message has been received in that time then the connection is assumed lost and will be re-connected.

The config looks like this:

input:
  mqtt:
    clean_session: false
    client_id: benthos_input
    qos: 1
    topics:
      - foo
    urls:
      - tcp://localhost:1883
    stale_connection_timeout: "30s"

If you could try running the latest commit without that field first it'd be useful to see if you get error logs that state "Connection lost for unknown reasons.", if the problem goes away then we can make do without stale_connection_timeout. If the problem persists then try running with stale_connection_timeout, and look out for error logs Stale connection timeout triggered, re-establishing connection to broker..

louisburton commented 4 years ago

I've been trying to stall the system since about 9:40am BST this morning. Admittedly, it has taken me running this overnight for hours to stall on occasion previously. However, I think the majority of the time I think I would have stalled it by now. One thing I have noticed which I haven't seen before is some Connection lost messages. I've not seen the exact unknown reasons message, but are these logs I'm seeing perhaps an indication that we're recovering from disconnections and thus it's not stalling?

image Also saw this on my MQTT Broker: image

louisburton commented 4 years ago

Had a couple more Connection lost due to: read tcp 10.17.36.238:47392->10.17.39.244:1883: read: connection reset by peer from Benthos, with some reconnections seen on the broker. I feel confident this is now hardened to this issue given its been 6 hrs of constant hammering and no stalling, so will cease testing.

I did not resort to using stale_connection_timeout: "30s", so I don't know if you want to keep it. It's nice to know it's available in case we run into trouble to be honest, but I understand if it adds unwanted complexity.

Thanks for all your efforts with this involved issue, that I realise boils down to resiliency against issues between MQTT client and broker. With these changes Benthos will be at the heart of our communication infrastructure, helping the company I work for in their fight against food waste - so it's much appreciated!

Jeffail commented 4 years ago

Awesome, thanks @louisburton, obviously feel free to reopen if you see any more issues. I'll probably add the field stale_connection_timeout with a deprecated flag so that it gets shipped but doesn't appear in the docs. That way it can stick around until I'm more confident.