linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 585 forks source link

KafkaAdminTopicConfigProvider.configure() should report original exception(s) raised by describeClusterConfigs #2087

Open marcelloromani opened 10 months ago

marcelloromani commented 10 months ago

I am using Cruise Control 2.5.99 with AWS MSK. I reached the point where Curise Control is able to connect to the Kafka brokers, but terminates with the following exception:

[2023-12-07 11:49:14,062] ERROR Uncaught exception on thread Thread[main,5,main] (com.linkedin.kafka.cruisecontrol.KafkaCruiseControlMain)
java.lang.RuntimeException: Failed to describe Kafka cluster configs.
    at com.linkedin.kafka.cruisecontrol.config.KafkaAdminTopicConfigProvider.configure(KafkaAdminTopicConfigProvider.java:174) ~[cruise-control-2.5.99.jar:?]
    at com.linkedin.kafka.cruisecontrol.config.KafkaCruiseControlConfigUtils.getConfiguredInstance(KafkaCruiseControlConfigUtils.java:49) ~[cruise-control-2.5.99.jar:?]
    at com.linkedin.kafka.cruisecontrol.config.KafkaCruiseControlConfig.getConfiguredInstance(KafkaCruiseControlConfig.java:98) ~[cruise-control-2.5.99.jar:?]
[...]

As the stacktrace indicates, that error message comes from: https://github.com/linkedin/cruise-control/blob/9ccbb9eeb497b23d9e98e76f7512abea908366af/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/config/KafkaAdminTopicConfigProvider.java#L172

  public void configure(Map<String, ?> configs) {
    _adminClient = (AdminClient) validateNotNull(
            configs.get(LoadMonitor.KAFKA_ADMIN_CLIENT_OBJECT_CONFIG),
            () -> String.format("Missing %s when creating Kafka Admin Client based Topic Config Provider",
                    LoadMonitor.KAFKA_ADMIN_CLIENT_OBJECT_CONFIG));
    Config clusterConfigs;
    try {
      clusterConfigs = describeClusterConfigs(_adminClient, DESCRIBE_CLUSTER_CONFIGS_TIMEOUT);
    } catch (InterruptedException | ExecutionException e) {
      throw new RuntimeException("Failed to describe Kafka cluster configs.");
    }
[...]

The catch statement simply swallows any ExecutionException and replaces any useful error with the generic message "Failed to describe Kafka cluster configs."

It would be useful if it instead logged the original exception or included it in the generic RunTimeException message.

marcelloromani commented 10 months ago

In my setup Cruise Control is running as a pod on an EKS cluster which resides in the same AWS Account as the MSK cluster.

I double checked the Pod Service Account -> IAM Role mapping and verified that the IAM Role has a Policy which allows all of the Kafka operations on the relevant MSK cluster.

Happy to provide more details about this if deemed useful.