redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.16k stars 559 forks source link

rpk: provide alternative to ClusterRoleBinding for `rpk debug bundle` in k8s environments #18058

Open r-vasquez opened 2 months ago

r-vasquez commented 2 months ago

Who is this for and what problem do they have today?

Currently, rpk relies on having a ClusterRole to collect the information needed for the debug bundle, see:

https://docs.redpanda.com/current/manage/kubernetes/troubleshooting/k-diagnostics-bundle/#generate-a-diagnostics-bundle

This is done to:

  1. Discover the admin API addresses of the cluster, currently, there is no way to do that. (See https://github.com/redpanda-data/redpanda/issues/8975).
  2. Collect the Logs of every pod in the cluster, this saves time in large clusters since the user only has to create one bundle instead of n-bundles.
  3. Collect k8s resources in the Redpanda namespace, for debugging.

Alternatives discussed:

This issue is to track the discussion, but the alternatives discussed are:

JIRA Link: CORE-2649

r-vasquez commented 2 weeks ago

One of the alternatives discussed with @chrisseto is to:

  1. Authenticate from an out-of-cluster client (rpk) to the k8s API using the kubeconfig file, Example: https://github.com/kubernetes/client-go/tree/master/examples/out-of-cluster-client-configuration
  2. Grab all the information needed from the k8s API, in this case: a. Logs b. Discover all pods in the cluster and get the Admin API addresses. c. Resources for debugging (current list here)
  3. From the out-of-cluster client, call every pod and execute a modified rpk debug bundle in each redpanda container. a. The modified command will need to be able to receive the previously-gathered k8s info. b. The modified command will need to return the bundle information in a way that the caller can read it. c. To execute a command, we can use: https://github.com/kubernetes/client-go/tree/master/tools/remotecommand
  4. The out-of-cluster client will have to stitch the bundles obtained from each pod.

This is a major refactor of the current way that the command works and will likely need an RFC first.

Please be aware that some of these changes are being made to overcome current limitations: