kafka4beam / brod

Apache Kafka client library for Erlang/Elixir
Apache License 2.0
660 stars 198 forks source link

payload connection down #431

Closed estin closed 3 years ago

estin commented 3 years ago

Hi!

Client to produce messages "closed" after ~10 minutes of idle

12:27:05.685 [info] client kafka_client connected to kafka-broker:9092
12:37:05.868 [info] client kafka_client: payload connection down kafka-broker:9092
reason:{shutdown,tcp_closed}

12:39:25.796 [info] client kafka_client connected to kafka-broker:9092
12:49:25.993 [info] client kafka_client: payload connection down kafka-broker:9092
reason:{shutdown,tcp_closed}

When I try to produce message client infinitely fail with this message

12:37:39.178 [warning] Failed to (re)init connection, reason:
{shutdown,tcp_closed}

After restarting whole app - all fine. Wait 10 minutes of idle and client can't reconnect to kafka.

brod version: 3.15.1 brod client configuration

[
 {allow_topic_auto_creation,true},
 {auto_start_producers,true},
 {reconnect_cool_down_seconds,<<"10">>},
 {default_producer_config, [
    {retry_backoff_ms,500},
    {max_retries,-1},
    {required_acks,0},
    {partition_buffer_limit,1024},
    {partition_onwire_limit,1024},
    {max_linger_ms,100},
    {max_linger_count,1024}
 ]}
]

brod used as part of plugin for vernemq (don't have enough experience at that fields - erl+brod+verne) Used k3s orchestration. Kafka is available at that time and can produce/consume messages.

Any ideas why each 10 minutes of idle client close connection and can't reconnect? May be needs some extra actions in supervisor or simple brod configuration?

Sorry for my poor English.

k32 commented 3 years ago

It sounds like somewhere along the way idle TCP connections get closed. Can you add the following configuration to the configuration of your brod client:

{extra_sock_opts, [{keepalive, true}]}

Also it looks like you're using plaintext connection, do you see the same error when using ssl?

Used k3s orchestration

I am not even remotely an expert in kubernetes, but which ingress controller do you use for the kafka broker?

estin commented 3 years ago

Please check this https://github.com/estin/verne_and_brod This is a full test case on docker environment (no k3s used). brod used in vernemq plugin

Kafka setting https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html#connections.max.idle.ms

And after payload connection down when kafka closed connection brod can't reconnect. Thanks!

mattias01 commented 3 years ago

I've been having similar problems.

The Kafka broker closes idle connections after 10 minutes. If you don't have any consumers and don't produce anything, the connection becomes idle and is closed. If you have consumers, they periodically send heartbeats so then it's not a problem (at least not from what I have seen). The Keep-alive time is often set much larger than 10 minutes by the OS so that won't work if you don't change that. On my machine it's set to cat /proc/sys/net/ipv4/tcp_keepalive_time => 7200. It might not work anyway because the Kafka broker might only look at requests, but I have not looked into that.

I also have the problem of running on Azure where they (load-balancers, event hubs) close connections after 4 minutes. So similar setups can have even lower allowed idle times for connections.

The java client has the options: metadata.max.age.ms and connections.max.idle.ms which you can use for keeping connections alive or at least close (and I assume re-open) them if they are idle too long.

I can't find any similar functionality in brod so I will handle it by wrapping brod and keep track of when a connection (brod_client) becomes idle, by keeping track of when I call produce. If it becomes idle, I will close and then start the client again. Hopefully this will fix my problems.

zmstone commented 3 years ago

{reconnect_cool_down_seconds,<<"10">>}, this is the real issue, should be {reconnect_cool_down_seconds,10}, instead.

This https://github.com/klarna/brod/blob/master/src/brod_client.erl#L715 compare always return false when a number compares to binary. the tcp_closed error is the last error re-emitted over and over again.

estin commented 3 years ago

@zmstone hmm... in repo where bug was reproduced this setting is ok https://github.com/estin/verne_and_brod/blob/master/src/vernemq_demo_plugin_sup.erl#L25

zmstone commented 3 years ago

bord producer is designed NOT to reconnect immediately when idle. the reconnect is triggered when the next produce request comes in. are you sure there are produce requests in your reproduce procedure ?

estin commented 3 years ago

@zmstone thanks! I'm understand it now! Now after {reconnect_cool_down_seconds,10} socket alive again misconfigred