Provide a way to get debug information from all nodes in a cluster

rkruze commented 2 years ago

Who is this for, and what problem do they have today?

Currently, an rpk debug zip command takes debug and log information from a single node. However, this doesn't get the logs from all the nodes needed to troubleshoot several issues.

What are the success criteria?

Have the ability to run an rpk debug info by providing some endpoint, like the admin API, which can grab all the logs and metrics needed to debug an issue of a cluster.

Why is solving this problem impactful?

It helps us reduce the amount of time it takes to resolve an issue.

vuldin commented 2 years ago

I have ran into issues with rpk debug bundle also, especially in a k8s deployment. Running the command with kubectl exec... doesn't work:

> k exec -it -n helm-test redpanda-0 -- rpk debug bundle
Defaulted container "redpanda" out of: redpanda, redpanda-configurator (init)
unable to create bundle: couldn't create bundle file: open 1654714425-bundle.zip: permission denied
command terminated with exit code 1

And then running it from within the container also has issues:

$ rpk debug bundle                                                                                                                                                           
unable to create bundle: couldn't create bundle file: open 1654713986-bundle.zip: permission denied                                                                          
$ cd ~                                                                                                                                                                       
$ mkdir tmp                                                                                                                                                                  
$ cd tmp                                                                                                                                                                     
$ rpk debug bundle                                                                                                                                                           
9 errors occurred:
        * failed to get the size of the kernel log buffer: operation not permitted                                                                                           
        * exec: "dmidecode": executable file not found in $PATH                                                                                                              
        * exec: "ss": executable file not found in $PATH                                                                                                                     
        * exec: "vmstat": executable file not found in $PATH
        * exec: "top": executable file not found in $PATH
        * exec: "dig": executable file not found in $PATH
        * exec: "ip": executable file not found in $PATH
        * exec: "lspci": executable file not found in $PATH
        * exec: "journalctl": executable file not found in $PATH

Debug bundle saved to '1654714014-bundle.zip'
$ unzip 1654714014-bundle.zip
sh: 11: unzip: not found

Here are some issues with the command in the k8s container:

can't be used by the default user most of the time, since the default user doesn't have permission to write to disk (outside of ~)
attempts depends on several other commands which are not available in the default Redpanda image (dmidecode, ss, vmstat, top, dig, ip, lspci, journalctl)
can't get the size of the kernel log buffer

Because of these errors, the generated bundle isn't much use. For instance, the log file (and many other files) are empty. 1654714014-bundle.zip

jcsp commented 2 years ago

The main thing to define about this is whether we're going to do it over SSH or funnel all the data through Redpanda itself (i.e. proxy things like copying log files through internal rpc).

Retrieving logs won't always be possible: if we're writing out logs to stderr and something external is capturing them, there's nothing redpanda/rpk can do to fetch logs. This is the situation in all kubernetes cases.

Since we can't grab logs in kubernetes, this tool will be mainly for non-containerized linux servers cases, where I think we could reasonably expect/require that the user has SSH keys set up, and rpk can use those keys to go grab whatever it wants (no need to funnel log aggregation through redpanda).

If we're just rpk'ing over SSH, that also makes it much easier for rpk to drive all the other debug telemetry gathering (basically ''the list of tools in Josh's error output above): the tool evolves to basically "do what you do today, but the other end of an SSH connection".

jcsp commented 2 years ago

@vuldin can you open a separate ticket about rpk debug bundle in containers? I think the high level thing there may be to just have it be a lot more polite about not running the parts it can't do in a container, when it detects that its in that environment, or perhaps even create a totally different "kubernetes aware" mode that knows how to go kubectl exec things and can also use kubectl to tell us all about the customer's CRD etc. But either way, I see that as a separate thing to making the (very desirable) multi-node log grabber for non-container systems.

rkruze commented 2 years ago

Could we persist the logs in the same directory as the data directory by default and thus Redpanda would have more control over the logs and be able to serve them up when needed via an API?

vuldin commented 2 years ago

New ticket here https://github.com/redpanda-data/redpanda/issues/5081

jcsp commented 2 years ago

Could we persist the logs in the same directory as the data directory by default and thus Redpanda would have more control over the logs and be able to serve them up when needed via an API?

In the kubernetes case this is more of a "should we?" than a "could we" -- it's pretty unexpected for a containerized application to do its own on-disk logging: the model for these systems is to have containers send logs somewhere central (and if you don't have a log aggregation platform, then kubernetes internally keeps a size-limited buffer of logs from each pod).

kubectl logs exists to deal with people who haven't got as far as building a real logging system yet: if we build a kubernetes-aware debug bundler, then it can go use kubectl to pull recent logs.

Maybe the strongest reason not to do log collection via redpanda itself is that we would like it to work on nodes that have a bad problem, such that redpanda isn't running (or when a whole cluster is offline). I think it would be quite limiting to build a debug tool that only works when the cluster is somewhat healthy.

So what makes sense to me is:

kubernetes: rpk fetches pod logs from kubernetes API, and also does other kubernetes-specific things like getting the cluster CRD
Plain Linux: rpk fetches logs via SSH, and also does other linxu specific things like calling low level tools (dmidecode etc).

Further out, hen we add audit logging to redpanda, I want to add a similar "health log" that contains high level cluster health history, and that would be exposed via admin API + collectable by rpk via that route.

mattschumpert commented 2 years ago

@twmb is this still happening in Q3 (please say yes :))

vuldin commented 1 year ago

This may be possible via the Kubernetes API, similar to what this PR does in the redpanda-configurator initContainer: https://github.com/redpanda-data/helm-charts/pull/224

twmb commented 1 year ago

rpk debug bundle as of v23.1 can capture logs from all nodes in a Kubernetes cluster. For self hosted, we also improved it to capture admin API endpoint responses from all hosts that are available (rather than just the host the bundle is being called on).

As follow up, we plan to automatically discover all hosts even if they are not in an existing configuration file and we plan to capture on-host process outputs (top, lsof, etc), but this requires some extra work from core. That said, this issue is basically done -- we’ll track the extra enhancements in a separate ticket.

rkruze commented 1 year ago

@twmb should we leave this open for the self-hosted scenario?

twmb commented 1 year ago

Additional requested features are being tracked in https://github.com/redpanda-data/redpanda/issues/10016

redpanda-data / redpanda