Closed rkruze closed 1 year ago
I have ran into issues with rpk debug bundle
also, especially in a k8s deployment. Running the command with kubectl exec...
doesn't work:
> k exec -it -n helm-test redpanda-0 -- rpk debug bundle
Defaulted container "redpanda" out of: redpanda, redpanda-configurator (init)
unable to create bundle: couldn't create bundle file: open 1654714425-bundle.zip: permission denied
command terminated with exit code 1
And then running it from within the container also has issues:
$ rpk debug bundle
unable to create bundle: couldn't create bundle file: open 1654713986-bundle.zip: permission denied
$ cd ~
$ mkdir tmp
$ cd tmp
$ rpk debug bundle
9 errors occurred:
* failed to get the size of the kernel log buffer: operation not permitted
* exec: "dmidecode": executable file not found in $PATH
* exec: "ss": executable file not found in $PATH
* exec: "vmstat": executable file not found in $PATH
* exec: "top": executable file not found in $PATH
* exec: "dig": executable file not found in $PATH
* exec: "ip": executable file not found in $PATH
* exec: "lspci": executable file not found in $PATH
* exec: "journalctl": executable file not found in $PATH
Debug bundle saved to '1654714014-bundle.zip'
$ unzip 1654714014-bundle.zip
sh: 11: unzip: not found
Here are some issues with the command in the k8s container:
dmidecode
, ss
, vmstat
, top
, dig
, ip
, lspci
, journalctl
)Because of these errors, the generated bundle isn't much use. For instance, the log file (and many other files) are empty. 1654714014-bundle.zip
The main thing to define about this is whether we're going to do it over SSH or funnel all the data through Redpanda itself (i.e. proxy things like copying log files through internal rpc).
Retrieving logs won't always be possible: if we're writing out logs to stderr and something external is capturing them, there's nothing redpanda/rpk can do to fetch logs. This is the situation in all kubernetes cases.
Since we can't grab logs in kubernetes, this tool will be mainly for non-containerized linux servers cases, where I think we could reasonably expect/require that the user has SSH keys set up, and rpk can use those keys to go grab whatever it wants (no need to funnel log aggregation through redpanda).
If we're just rpk'ing over SSH, that also makes it much easier for rpk to drive all the other debug telemetry gathering (basically ''the list of tools in Josh's error output above): the tool evolves to basically "do what you do today, but the other end of an SSH connection".
@vuldin can you open a separate ticket about rpk debug bundle in containers? I think the high level thing there may be to just have it be a lot more polite about not running the parts it can't do in a container, when it detects that its in that environment, or perhaps even create a totally different "kubernetes aware" mode that knows how to go kubectl exec
things and can also use kubectl to tell us all about the customer's CRD etc. But either way, I see that as a separate thing to making the (very desirable) multi-node log grabber for non-container systems.
Could we persist the logs in the same directory as the data directory by default and thus Redpanda would have more control over the logs and be able to serve them up when needed via an API?
New ticket here https://github.com/redpanda-data/redpanda/issues/5081
Could we persist the logs in the same directory as the data directory by default and thus Redpanda would have more control over the logs and be able to serve them up when needed via an API?
In the kubernetes case this is more of a "should we?" than a "could we" -- it's pretty unexpected for a containerized application to do its own on-disk logging: the model for these systems is to have containers send logs somewhere central (and if you don't have a log aggregation platform, then kubernetes internally keeps a size-limited buffer of logs from each pod).
kubectl logs
exists to deal with people who haven't got as far as building a real logging system yet: if we build a kubernetes-aware debug bundler, then it can go use kubectl to pull recent logs.
Maybe the strongest reason not to do log collection via redpanda itself is that we would like it to work on nodes that have a bad problem, such that redpanda isn't running (or when a whole cluster is offline). I think it would be quite limiting to build a debug tool that only works when the cluster is somewhat healthy.
So what makes sense to me is:
Further out, hen we add audit logging to redpanda, I want to add a similar "health log" that contains high level cluster health history, and that would be exposed via admin API + collectable by rpk via that route.
@twmb is this still happening in Q3 (please say yes :))
This may be possible via the Kubernetes API, similar to what this PR does in the redpanda-configurator
initContainer: https://github.com/redpanda-data/helm-charts/pull/224
rpk debug bundle
as of v23.1 can capture logs from all nodes in a Kubernetes cluster. For self hosted, we also improved it to capture admin API endpoint responses from all hosts that are available (rather than just the host the bundle is being called on).
As follow up, we plan to automatically discover all hosts even if they are not in an existing configuration file and we plan to capture on-host process outputs (top, lsof, etc), but this requires some extra work from core. That said, this issue is basically done -- we’ll track the extra enhancements in a separate ticket.
@twmb should we leave this open for the self-hosted scenario?
Additional requested features are being tracked in https://github.com/redpanda-data/redpanda/issues/10016
Who is this for, and what problem do they have today?
Currently, an
rpk debug zip
command takes debug and log information from a single node. However, this doesn't get the logs from all the nodes needed to troubleshoot several issues.What are the success criteria?
Have the ability to run an
rpk debug info
by providing some endpoint, like the admin API, which can grab all the logs and metrics needed to debug an issue of a cluster.Why is solving this problem impactful?
It helps us reduce the amount of time it takes to resolve an issue.