Closed oarribas closed 1 year ago
A good way to do this would be to parse oc get pod --all-namespaces -o json
with jq to produce a reduced JSON summary (kudos to @gmeghnag for the idea).
Not sure if jq
is included in must-gather image already, but it is included in OCP4 repos, so it shouldn't be a problem.
Something like the following (need to be tested to check if it is valid):
$ oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq '[.items[]| {"name": .metadata.name, node: .spec.nodeName, resources: .spec.containers[].resources}]'
an example:
oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq '[.items[]| {"name": .metadata.name, node: .spec.nodeName, resources: .spec.containers[].resources}]'
[
{
"name": "openshift-apiserver-operator-85bc4dfdb4-zj6xn",
"node": "ip-10-0-215-218.eu-central-1.compute.internal",
"resources": {
"requests": {
"cpu": "10m",
"memory": "50Mi"
}
}
},
{
"name": "apiserver-6f8b7d589f-69kt4",
"node": "ip-10-0-131-238.eu-central-1.compute.internal",
"resources": {
"requests": {
"cpu": "100m",
"memory": "200Mi"
}
}
},
...
Or if we want the same output by node name, something like the following:
NODE=<NODE_NAME>
oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq --arg NODE "$NODE" '[.items[]| select(.spec.nodeName==$NODE) | {name: .metadata.name, node: .spec.nodeName, resources: .spec.containers[].resources}]'
You would like to have the namespace in that output. It is perfectly possible to have more than one pod with the same name, specially if they come from statefulsets or are created by some custom controller (or by hand)
For the rest, it looks fine.
I'd also suggest using -c
option of jq to produce compact output and not wrap inside an array. That way, one can both use jq and grep on the results (this is what the audit logs do, for reference).
I updated the query to display also the containerName
:
oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq --arg NODE "$NODE" '.items[]| select(.spec.nodeName==$NODE) | . as $pod | .spec.containers[] | {node: $pod.spec.nodeName, namespace: $pod.metadata.namespace, podName: $pod.metadata.name, containerName: .name, resources: .resources}' -c
an example:
oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq --arg NODE "$NODE" '.items[]| select(.spec.nodeName==$NODE) | . as $pod | .spec.containers[] | {node: $pod.spec.nodeName, namespace: $pod.metadata.namespace, podName: $pod.metadata.name, containerName: .name, resources: .resources}' -c | head -5
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-apiserver","podName":"apiserver-6f8b7d589f-69kt4","containerName":"openshift-apiserver","resources":{"requests":{"cpu":"100m","memory":"200Mi"}}}
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-apiserver","podName":"apiserver-6f8b7d589f-69kt4","containerName":"openshift-apiserver-check-endpoints","resources":{"requests":{"cpu":"10m","memory":"50Mi"}}}
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-authentication","podName":"oauth-openshift-59795457bf-sbg4n","containerName":"oauth-openshift","resources":{"requests":{"cpu":"10m","memory":"50Mi"}}}
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-cluster-csi-drivers","podName":"aws-ebs-csi-driver-controller-676777c46f-2cqn5","containerName":"csi-driver","resources":{"requests":{"cpu":"10m","memory":"50Mi"}}}
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-cluster-csi-drivers","podName":"aws-ebs-csi-driver-controller-676777c46f-2cqn5","containerName":"driver-kube-rbac-proxy","resources":{"requests":{"cpu":"10m","memory":"20Mi"}}}
But in a must-gather, not all namespaces/pods are collected.
This is intentional, we are only focusing on control-plane related data which is required to diagnose the cluster state and help our customers resolve the problem. Also, collecting any kind of data spanning all namespaces would risk exposing various Personal Identifiable Information which we would be required to remove from the collected data set, which isn't a trivial task to undertake. Lastly, every data we scrape increase the overall size of the archive, which when working in a cluster with a few nodes isn't that big of a deal, but when you reach clusters with hundreds, or thousands of nodes, the extra few bytes make a significant difference. This forces us to justify any addition in terms of balance between how much data we have to gather every time vs what data we can request in followup engagements with our customers.
That information could help to identify overcommitted nodes, pods without requests/limits, etc.
That is valid use-case, but with the current capabilities OpenShift has, that kind of information would be much better suited to be expose in OpenShift Insights, based on cluster metrics and suggest any actions a user might take to help them improve the stability and availability of their cluster.
Based on the above, as well as other information presented in this issue, I'm closing this as won't fix.
/close
@soltysh: Closing this issue.
Collect similar information than the information shown by
oc describe nodes
, like CPU/memory limits and requests per pod, allocated resources in the nodes, real resource usage by pods (maybe also by containers) in must-gather.Currently, information like the following is shown in an
oc describe nodes
output:But in a must-gather, not all namespaces/pods are collected.
That information could help to identify overcommitted nodes, pods without requests/limits, etc.