linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

populate_disk_info is not working with default broker config #1953

Open bobelev opened 1 year ago

bobelev commented 1 year ago

Let's say we have identical brokers and capacityJSON has only default broker definition:

{
  "brokerCapacities": [
    {
     "brokerId": "-1",
      "capacity": {
        "CPU": "100",
        "DISK": {
          "/var/lib/kafka/data1/data": "1500000",
          "/var/lib/kafka/data2/data": "1500000",
          "/var/lib/kafka/data3/data": "1500000",
          "/var/lib/kafka/data4/data": "1500000"
        },
        "NW_IN": "625000",
        "NW_OUT": "625000"
      }
    }
  ]
}

If you try to get disk_info you'll get an error

GET /load?allow_capacity_estimation=true&json=true&populate_disk_info=true

Error processing GET request '/load' due to:
com.linkedin.kafka.cruisecontrol.exception.BrokerCapacityResolutionException: 
Unable to resolve capacity of broker 3. Either (1) adding the default broker capacity 
(via adding capacity for broker -1 and allowing capacity estimation) or 
(2) adding missing broker's capacity in file /etc/cruise-control/capacityJBOD.json.

If you generate config for each broker in a cluster, endpoint works.

I think this might be related to an uneven disk capacity usage during cluster rebalancing (#1590). In my setup some of the disks are getting more than 95% used space. So proposal execution must be stopped in order to rebalance disks manually.

marcelloromani commented 8 months ago

I see the same behaviour with Cruise Control 2.5.134

Error message:

Error processing GET request '/load' due to: 'com.linkedin.kafka.cruisecontrol.exception.BrokerCapacityResolutionException: Unable to resolve capacity of broker 2

capacity.json:

{
    "brokerCapacities": [
        {
            "brokerId": "-1",
            "capacity": {
                "CPU": "100",
                "DISK": {
                    "/data1": "1000000"
                },
                "NW_IN": "10000",
                "NW_OUT": "10000"
            }
        }
    ]
}

brokerSets.json

{
    "brokerSets": [
        {
            "brokerSetId": "brokerSet0",
            "brokerIds": [0, 1]
        }
    ]
}

Perhaps my broker set is misconfigured? I see a reference to broker 2 in the error message.

Since I haven't specified a custom config file resolver, I am assuming the default specified in the comments applies, namely BrokerCapacityConfigFileResolver.

marcelloromani commented 8 months ago

I changed brokerSets to:

{
    "brokerSets": [
        {
            "brokerSetId": "brokerSet0",
            "brokerIds": [1, 2]
        }
    ]
}

and the error message changed to:

Error processing GET request '/load' due to: 'com.linkedin.kafka.cruisecontrol.exception.BrokerCapacityResolutionException: Unable to resolve capacity of broker 1

marcelloromani commented 8 months ago

At first glance I can't see the logic that would support reading the default values for a broker from broker id -1: https://github.com/linkedin/cruise-control/blob/migrate_to_kafka_2_5/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/config/BrokerCapacityConfigFileResolver.java#L181

marcelloromani commented 8 months ago

What fixed it for me:

capacity.json:

{
    "brokerCapacities": [
        {
            "brokerId": "1",
            "capacity": {
                "CPU": "100",
                "DISK": {
                    "/kafka/datalogs/logs": "1000000"
                },
                "NW_IN": "10000",
                "NW_OUT": "10000"
            }
        },
        {
            "brokerId": "2",
            "capacity": {
                "CPU": "100",
                "DISK": {
                    "/kafka/datalogs/logs": "1000000"
                },
                "NW_IN": "10000",
                "NW_OUT": "10000"
            }
        }
    ]
}