Open emelyanovtv opened 3 years ago
Is anybody can help with some assumptions or something? Because I'm running out of an ideas.
@emelyanovtv I am curious what the load
endpoint of Cruise Control shows with populate_disk_info=true
.
I wonder if the same disk is used by other services, causing the capacity to be used by not only Kafka, but also some other service.
those disks are only for kafka brokers (dedicated). BTW, after we got this error I did rebalance manually. I'll post part of data for brokers, but I checked and everything seems to me valid.
HOST BROKER RACK LOGDIR DISK_CAP(MB) DISK(MB)/_(%)_ CORE_NUM CPU(%) NW_IN_CAP(KB/s) LEADER_NW_IN(KB/s) FOLLOWER_NW_IN(KB/s) NW_OUT_CAP(KB/s) NW_OUT(KB/s) PNW_OUT(KB/s) LEADERS/REPLICAS
kafka-0.broker, 200,rack-b, 7168000.000, 5019532.000/70.03, 1, 71.517, 10000.000, 223.701, 463.693, 10000.000, 1221.138, 3717.967, 366/1128
/var/dirs/kafka/data/topics, 3005975.379/83.87, 320/992
/var/dirs/kafka/data1/topics, 2013565.704/56.18, 46/136
kafka-1.broker, 201,rack-c, 7168000.000, 4524581.500/63.12, 1, 77.026, 10000.000, 225.302, 519.606, 10000.000, 1261.675, 4125.801, 358/1121
/var/dirs/kafka/data/topics, 2928987.955/81.72, 326/1040
/var/dirs/kafka/data1/topics, 1595596.534/44.52, 32/81
kafka-2.broker, 202,rack-a, 7168000.000, 4736796.500/66.08, 1, 66.809, 10000.000, 233.720, 509.744, 10000.000, 1282.256, 4090.291, 419/1097
/var/dirs/kafka/data/topics, 2971493.640/82.91, 388/986
/var/dirs/kafka/data1/topics, 1765304.770/49.26, 31/111
@efeg any ideas?
Description:
We got the error:
disk is full
during rebalancing. Basically, we have 2 logs dirs per broker, which have the same size. But when rebalancing was running, we noticed that only one disk (log dir) for the broker has been filled with new data. The disk capacity (will be described below) for the log dir/var/dirs/kafka/data/topics
was set to 3584000 Mb but once it finished with error and after we increased disk size for this specif log dir (out of disk space) became 3644675 Mb. Why can such kinds of things happen? Can you help us to have more clear explanations for this error?The main assumption why this happened is that we moved almost 11 Tb of data among brokers and it took 2 days. Perhaps it can be root cause for this error.
The steps how it was:
3584000 Mb
3644675 Mb
Question:
Current setup
POST /kafkacruisecontrol/rebalance?json=true&dryrun=false&concurrent_partition_movements_per_broker=4&concurrent_leader_movements=10
cruisecontrol.properties:
capacity.json (for all brokers we have the same settings as below)
If you need more info from me I'll share it with you easilly.