linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Kubernetes integration #74

Open otisg opened 6 years ago

otisg commented 6 years ago

https://www.youtube.com/watch?v=lf31udm9cYY mentions Kubernetes integration and Kubernetes has a notion of Kubernetes Operator. It would be great if one could use CC with k8s.

becketqin commented 6 years ago

@otisg Yes, that is something valuable and it is on our road map. This may require some more thoughts and discussion though.

otisg commented 6 years ago

Right. Do you have a rough idea of the timeline?

StevenACoffman commented 6 years ago

@otisg You may wish to check out this kafka-operator which integrates cruise control.

Alternatively, this issue Yolean/kubernetes-kafka#100 seeks to integrate cruise control into the major Kubernetes Kafka implementation solution.

Unfortunately, there's currently a disconnect between the two approaches.

becketqin commented 6 years ago

@StevenACoffman Wow! This is awesome. Great to see the integration with K8S!

baluchicken commented 5 years ago

We've just released an open-source Kafka operator which integrates Cruise Control, and would be very happy to receive some feedback. You can check out the repo here.

kyguy commented 5 years ago

Since there are a few different options for running Kafka on Kubernetes, I think we should come up with a standard way of running Cruise Control on Kubernetes. For starters, we could provide a standard:

For what I understand, Cruise Control would also require a couple of patches for resource estimation and partition reassignment proposals while running on Kubernetes:

  1. Currently, Cruise Control relies on the hostname of a Kafka broker e.g. node.host() to uniquely identify the node which that broker is running on [1]. However, assuming that every broker is run in its own Kubernetes pod, node.host() will return the hostname of the pod, not the hostname of the node that broker pod is running on. This leads Cruise Control to underestimate the resources available on nodes which are running more than one broker pod.
  2. Since Cruise Control is not pod aware, it’s also possible that Cruise Control could propose partition assignments that place partition replicas on broker pods running on the same node. So if a node goes down, its possible we would lose more than one replica at once.

Using the Kubernetes API, we get around these issues by mapping broker pods to their actual nodes as Cruise Control builds its cluster model.

What do you all think? Any other ideas?

[1] https://github.com/linkedin/cruise-control/blob/2.0.70/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/LoadMonitor.java#L508 ( Edited for static link )

kyguy commented 4 years ago

Would people be open to a patch that would:

  1. Offer a boolean configuration option in the cruisecontrol.properties file that would specify whether or not Cruise Control would be running in a Kubernetes environment.
  2. (and if the boolean option is set to true) Have a Kubernetes client use the hostname provided by node.host[1] to find the actual nodeName of where that broker host/pod is running from the Kubernetes API. Then use that nodeName as a unique identifier for the "hosts"(nodes) where the brokers reside when building the cluster model [2].

This would fix the two issues mentioned in the previous comment

[1] https://github.com/linkedin/cruise-control/blob/2.0.70/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/LoadMonitor.java#L508 [2] https://github.com/linkedin/cruise-control/blob/2.0.70/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/model/Rack.java#L259