pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 489 forks source link

Need a tool for collecting TiDB operator related information for analysis #4686

Open hanlins opened 1 year ago

hanlins commented 1 year ago

Feature Request

Is your feature request related to a problem? Please describe:

For on-prem workloads, sometimes we need to collect information related to tidb-operator and hand it to the SRE team for further analysis. It would be nice to have a tool to automate the procedures.

Describe the feature you'd like:

It would be nice if tidb-operator could provide a binary that can help collect related information in the cluster with simple configurations.

Describe alternatives you've considered:

We can ask the SRE team to run kubectl manually to collect information, but it's error-prone and might need multiple rounds of communication to collect the information we need. We can also deliver a script to automate this, but it might be difficult to customize.

Teachability, Documentation, Adoption, Migration Strategy:

Given the binary and a kubeconfig for the cluster, users can simply run the binary to collect the info they need, and export it to maybe a zip bundle. Users can use a config file to config the information they want to collect, e.g. they can specify the namespace to collect information.

yiduoyunQ commented 1 year ago

maybe use kubectl api-resources to collect all (needed) config can work for this?

kubectl api-resources --verbs=list -n tidb-cluster -o name | grep -Ei "tidbclusters.pingcap.com|pods" | xargs -n 1 kubectl get -n tidb-cluster -o yaml

hanlins commented 1 year ago

maybe use kubectl api-resources to collect all (needed) config can work for this?

kubectl api-resources --verbs=list -n tidb-cluster -o name | grep -Ei "tidbclusters.pingcap.com|pods" | xargs -n 1 kubectl get -n tidb-cluster -o yaml

Hey @yiduoyunQ, thanks for your insights! I think the command you shared can cover most cases, and we can easily collect more resources by adding resource names in grep or by changing the namespace. I think my main concern is sometimes the customers we're dealing with don't have sufficient expertise for either k8s or tidb-operator, and they need our involvement in collecting the data we need in their own environments before we can start the investigation. This takes quite some time, especially for customers in a different time zone. Another concern is, the command could collect more information than we need. For instance, sometimes we want to collect the information for tidb clusters (together with its pods, asts) with certain labels in ns1, ns2 but not ns3, then we need to craft a new command and run it for ns1 and ns2 respectively (with label selectors). If people just replace -n tidb-cluster with -A (to collect resources in all namespaces), or they forgot the label selector, then it could easily lead to data exfiltration.

I think we need to document the steps for collecting related data in our public documentation so our support engineers can point it to our customers without back and forth. As for how to collect the information, I'm inclined to craft a user-friendly tool for info collection. But before that is available/mature, we can share the command you mentioned above with our customer on our website. How about that?