CC caused "disk is full" for one of logDirs for broker during rebalancing.

emelyanovtv commented 3 years ago

Description:

We got the error: disk is full during rebalancing. Basically, we have 2 logs dirs per broker, which have the same size. But when rebalancing was running, we noticed that only one disk (log dir) for the broker has been filled with new data. The disk capacity (will be described below) for the log dir /var/dirs/kafka/data/topics was set to 3584000 Mb but once it finished with error and after we increased disk size for this specif log dir (out of disk space) became 3644675 Mb. Why can such kinds of things happen? Can you help us to have more clear explanations for this error?

The main assumption why this happened is that we moved almost 11 Tb of data among brokers and it took 2 days. Perhaps it can be root cause for this error.

The steps how it was:

Run rebalancing
Hit the limit for one broker and one of the log dirs has been full. Size for the failed broker and on of the log dir (let's say broker-1 /var/dirs/kafka/data/topics ) was 3584000 Mb
We increased the size for this specific log dir
Restarted the broker
Everything has been up successfully, the size for the same log dir (broker-1 /var/dirs/kafka/data/topic) became 3644675 Mb

Question:

Is it happened to anyone before, and how I can avoid that?

Current setup

CC: 2.0.168
kafka: 2.3.0
execution task has been triggered POST /kafkacruisecontrol/rebalance?json=true&dryrun=false&concurrent_partition_movements_per_broker=4&concurrent_leader_movements=10
total duration before it was failed took almost 50 hours (179776 sec)
each broker has 2 log dirs.
settings

cruisecontrol.properties:

capacity.config.file=config/capacity.json

capacity.json (for all brokers we have the same settings as below)

{
              "brokerId": "N",
              "capacity": {
                  "DISK": {
                    "/var/dirs/kafka/data/topics": "3584000",
                    "/var/dirs/kafka/data1/topics": "3584000"
                  },
                  "CPU": "100",
                  "NW_IN": "10000",
                  "NW_OUT": "10000"
              },
              "doc": "Capacity unit used for disk is in MB, cpu is in percentage, network throughput is in KB."
          }

If you need more info from me I'll share it with you easilly.

emelyanovtv commented 3 years ago

Is anybody can help with some assumptions or something? Because I'm running out of an ideas.

efeg commented 3 years ago

@emelyanovtv I am curious what the load endpoint of Cruise Control shows with populate_disk_info=true. I wonder if the same disk is used by other services, causing the capacity to be used by not only Kafka, but also some other service.

emelyanovtv commented 3 years ago

those disks are only for kafka brokers (dedicated). BTW, after we got this error I did rebalance manually. I'll post part of data for brokers, but I checked and everything seems to me valid.

                       HOST         BROKER          RACK                               LOGDIR        DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS
 kafka-0.broker,           200,rack-b,                                             7168000.000,        5019532.000/70.03,                  1,        71.517,               10000.000,                 223.701,                 463.693,               10000.000,           1221.138,           3717.967,           366/1128
                                                                                  /var/dirs/kafka/data/topics,        3005975.379/83.87,                                                                                                                    320/992
                                                                                 /var/dirs/kafka/data1/topics,        2013565.704/56.18,                                                                                                                     46/136
 kafka-1.broker,           201,rack-c,                                             7168000.000,        4524581.500/63.12,                  1,        77.026,               10000.000,                 225.302,                 519.606,               10000.000,           1261.675,           4125.801,           358/1121
                                                                                  /var/dirs/kafka/data/topics,        2928987.955/81.72,                                                                                                                    326/1040
                                                                                 /var/dirs/kafka/data1/topics,        1595596.534/44.52,                                                                                                                     32/81
 kafka-2.broker,           202,rack-a,                                             7168000.000,        4736796.500/66.08,                  1,        66.809,               10000.000,                 233.720,                 509.744,               10000.000,           1282.256,           4090.291,           419/1097
                                                                                  /var/dirs/kafka/data/topics,        2971493.640/82.91,                                                                                                                    388/986
                                                                                 /var/dirs/kafka/data1/topics,        1765304.770/49.26,                                                                                                                     31/111

emelyanovtv commented 3 years ago

@efeg any ideas?

linkedin / cruise-control