In deployments where a pool of heavy forwarders or indexers are fronted by an external load-balancer, the configuration of the Kafka Connector will only contain a single address. In this scenario, the out-of-band health-check does not take into account the available capacity of the pool behind the external load balancer. When a health check fails, all channels are removed for a configurable period of time including some that may be otherwise healthy. Although this is configurable (by default 120 seconds), frequently adding/removing channels based on an out-of-band check does not seem very elegant or efficient.
Furthermore, despite a successful out-of-band health check, the indexer object of the Kafka Connector may still receive a 503 result code from an indexer/heavy forwarder. This triggers the back-pressure handling, which I would consider an in-band health-check. In contrast, the channel that has back-pressure refers to a specific TCP session that is also typically maintained by a keep-alive. Avoiding a channel that has back-pressure for a preset period of time is a reasonable thing for the indexer object to do.
In short, when an external load-balancer is used, the out-of-band health-check does not seem very useful. Therefore, I propose that if splunk.hec.lb.poll.interval is set to say “-1” (or any negative integer) that would disable the out-of-band health-check.
In deployments where a pool of heavy forwarders or indexers are fronted by an external load-balancer, the configuration of the Kafka Connector will only contain a single address. In this scenario, the out-of-band health-check does not take into account the available capacity of the pool behind the external load balancer. When a health check fails, all channels are removed for a configurable period of time including some that may be otherwise healthy. Although this is configurable (by default 120 seconds), frequently adding/removing channels based on an out-of-band check does not seem very elegant or efficient.
Furthermore, despite a successful out-of-band health check, the indexer object of the Kafka Connector may still receive a 503 result code from an indexer/heavy forwarder. This triggers the back-pressure handling, which I would consider an in-band health-check. In contrast, the channel that has back-pressure refers to a specific TCP session that is also typically maintained by a keep-alive. Avoiding a channel that has back-pressure for a preset period of time is a reasonable thing for the indexer object to do.
In short, when an external load-balancer is used, the out-of-band health-check does not seem very useful. Therefore, I propose that if
splunk.hec.lb.poll.interval
is set to say “-1” (or any negative integer) that would disable the out-of-band health-check.