Open filippog opened 4 months ago
Hey @filippog 👋 Thank you for the detailed report! If you still have that instance running and you can poke at it, does it continue to print unable to dial: dial tcp...
errors or did it get completely stuck and it's no longer emitting any logs? Based on the goroutine dump it's definitely still running the loop which is meant to "poll fetches": https://github.com/redpanda-data/connect/blob/8277b6088df5829bead2e2a5e131ed3ad8e518fe/internal/impl/kafka/input_kafka_franz.go#L652C7-L652C23.
If it keeps printing those errors, then I'm wondering if it may be some sort of DNS issue on the host where Benthos is running and, if that's the case, it would be handy to know if you can telnet to the seed broker addresses from that host. I don't know how the franz-go
handles DNS and if it does any caching, but we can look into that if you can confirm the above.
Hi @mihaitodor ! thank you for following up, unfortunately the instance is no longer running as it was restarted to restore service. It was indeed stuck and not emitting logs, these are the last lines (value interpolation error is known and something emitted regularly and we're addressing it separatedly, those regular logs stopped too). You can see logs stop at 15:16 and then back at 18:33 when SIGQUIT was issued
Jul 16 15:12:29 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jul 16 15:12:29 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jul 16 15:16:29 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093:
i/o timeout" label=webrequest_sampled_out path=root.output Jul 16 15:16:39 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093:
i/o timeout" label=webrequest_sampled_out path=root.output Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: SIGQUIT: quit
Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: PC=0x4721c1 m=0 sigcode=0
Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: goroutine 0 [idle]:
And the service was then restarted and came back up here:
Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Running main config from specified file" @service=benthos path=/etc/benthos/mw_accesslog_metrics.yaml
Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Listening for HTTP requests at: http://0.0.0.0:4152" @service=benthos
Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Receiving messages from Kafka topics: [mediawiki.httpd.accesslog]" @service=benthos label=accesslog path=root.input
Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Launching a benthos instance, use CTRL+C to close" @service=benthos
Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Dropping messages." @service=benthos label=drop path=root.output
If it keeps printing those errors, then I'm wondering if it may be some sort of DNS issue on the host where Benthos is running and, if that's the case, it would be handy to know if you can telnet to the seed broker addresses from that host. I don't know how the
franz-go
handles DNS and if it does any caching, but we can look into that if you can confirm the above.
From the host perspective and given the network maintenance the outage manifested itself with temporary unavailability of some broker hosts (i.e. DNS kept working as usual).
For comparison and in case it is useful, here's what another benthos instance on the same host logged, this instance kept working as usual, and it is talking to a different kafka cluster, of which one broker was equally affected by the maintenance:
Jul 16 15:16:29 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093: i/o timeout" label=webrequest_sampled_out path=root.output
Jul 16 15:16:39 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093: i/o timeout" label=webrequest_sampled_out path=root.output
I'm hoping this is a one-off and IIRC the first time we've seen this, however I'm reporting it in case other folks run into the same and/or you have seen this before. Thank you !
We are experiencing the same issue. The process halts with the following message:
the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists
Additionally, the readiness and liveness probes fail repeatedly with the error:
Readiness probe failed: Get "http://xxx/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
As a result, the application restarts. This behaviour has occurred more than twice today.
Is there any additional information or tests I can provide to assist in diagnosing the issue?
Hello, we're using
benthredpanda connect 4.15 with great success, for the biggest part to read/write to/from kafka usingkafka_franz
component.Yesterday there's been a network maintenance which temporarily made a broker unavailable, which made
kafka_franz
lament a failure to commit. This has happened before and things self-recovered, not this time though where consuming was stopped (and observed consumer lag didn't increase, or we would have noticed sooner).Please find below the logs from benthos
The full goroutine dump which we got via SIGQUIT: https://phabricator.wikimedia.org/P66726 And the configuration which was running at the time: https://phabricator.wikimedia.org/P66727
Finally, the metrics showing outage + recovery (all times UTC everywhere)
Interestingly enough, a similar but different benthos instance, acting on a different kafka cluster / broker, which also was affected by the maintenance, was able to recover on its own as it's been our experience with similar maintenance in the past.