linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

CC caused "disk is full" for one of logDirs for broker during rebalancing. #1590

Open emelyanovtv opened 3 years ago

emelyanovtv commented 3 years ago

Description:

We got the error: disk is full during rebalancing. Basically, we have 2 logs dirs per broker, which have the same size. But when rebalancing was running, we noticed that only one disk (log dir) for the broker has been filled with new data. The disk capacity (will be described below) for the log dir /var/dirs/kafka/data/topics was set to 3584000 Mb but once it finished with error and after we increased disk size for this specif log dir (out of disk space) became 3644675 Mb. Why can such kinds of things happen? Can you help us to have more clear explanations for this error?

The main assumption why this happened is that we moved almost 11 Tb of data among brokers and it took 2 days. Perhaps it can be root cause for this error.

The steps how it was:

Question:

Current setup

cruisecontrol.properties:

capacity.config.file=config/capacity.json

capacity.json (for all brokers we have the same settings as below)

{
              "brokerId": "N",
              "capacity": {
                  "DISK": {
                    "/var/dirs/kafka/data/topics": "3584000",
                    "/var/dirs/kafka/data1/topics": "3584000"
                  },
                  "CPU": "100",
                  "NW_IN": "10000",
                  "NW_OUT": "10000"
              },
              "doc": "Capacity unit used for disk is in MB, cpu is in percentage, network throughput is in KB."
          }

If you need more info from me I'll share it with you easilly.

emelyanovtv commented 3 years ago

Is anybody can help with some assumptions or something? Because I'm running out of an ideas.

efeg commented 3 years ago

@emelyanovtv I am curious what the load endpoint of Cruise Control shows with populate_disk_info=true. I wonder if the same disk is used by other services, causing the capacity to be used by not only Kafka, but also some other service.

emelyanovtv commented 3 years ago

those disks are only for kafka brokers (dedicated). BTW, after we got this error I did rebalance manually. I'll post part of data for brokers, but I checked and everything seems to me valid.

                       HOST         BROKER          RACK                               LOGDIR        DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS
 kafka-0.broker,           200,rack-b,                                             7168000.000,        5019532.000/70.03,                  1,        71.517,               10000.000,                 223.701,                 463.693,               10000.000,           1221.138,           3717.967,           366/1128
                                                                                  /var/dirs/kafka/data/topics,        3005975.379/83.87,                                                                                                                    320/992
                                                                                 /var/dirs/kafka/data1/topics,        2013565.704/56.18,                                                                                                                     46/136
 kafka-1.broker,           201,rack-c,                                             7168000.000,        4524581.500/63.12,                  1,        77.026,               10000.000,                 225.302,                 519.606,               10000.000,           1261.675,           4125.801,           358/1121
                                                                                  /var/dirs/kafka/data/topics,        2928987.955/81.72,                                                                                                                    326/1040
                                                                                 /var/dirs/kafka/data1/topics,        1595596.534/44.52,                                                                                                                     32/81
 kafka-2.broker,           202,rack-a,                                             7168000.000,        4736796.500/66.08,                  1,        66.809,               10000.000,                 233.720,                 509.744,               10000.000,           1282.256,           4090.291,           419/1097
                                                                                  /var/dirs/kafka/data/topics,        2971493.640/82.91,                                                                                                                    388/986
                                                                                 /var/dirs/kafka/data1/topics,        1765304.770/49.26,                                                                                                                     31/111
emelyanovtv commented 3 years ago

@efeg any ideas?