splunk / kafka-connect-splunk

Kafka connector for Splunk
Apache License 2.0
93 stars 102 forks source link

Allow for Out-of-Band Health-check in LoadBalancer to be Disabled #324

Closed hubert-s closed 2 years ago

hubert-s commented 2 years ago

In deployments where a pool of heavy forwarders or indexers are fronted by an external load-balancer, the configuration of the Kafka Connector will only contain a single address. In this scenario, the out-of-band health-check does not take into account the available capacity of the pool behind the external load balancer. When a health check fails, all channels are removed for a configurable period of time including some that may be otherwise healthy. Although this is configurable (by default 120 seconds), frequently adding/removing channels based on an out-of-band check does not seem very elegant or efficient.

Furthermore, despite a successful out-of-band health check, the indexer object of the Kafka Connector may still receive a 503 result code from an indexer/heavy forwarder. This triggers the back-pressure handling, which I would consider an in-band health-check. In contrast, the channel that has back-pressure refers to a specific TCP session that is also typically maintained by a keep-alive. Avoiding a channel that has back-pressure for a preset period of time is a reasonable thing for the indexer object to do.

In short, when an external load-balancer is used, the out-of-band health-check does not seem very useful. Therefore, I propose that if splunk.hec.lb.poll.interval is set to say “-1” (or any negative integer) that would disable the out-of-band health-check.

kashyap-splunk commented 2 years ago

Thanks @hubert-s for working on this. We will look into this and update here as soon as we can.