[RFE] Include the output of oc describe nodes in the must-gather

oarribas commented 1 year ago

The oc describe nodes command provides information about the requests/limits, pods in the nodes, allocated resources and events related to each node. Having that information in the must-gather report could help a lot with troubleshooting. If possible, the description of each node in separated files with the name of the node.

sferich888 commented 1 year ago

[oc and kubectl] describe is a veneer that more or less pretty prints data from other objects (or api calls).

https://github.com/kubernetes/kubectl/blob/master/pkg/describe/describe.go#L201 https://github.com/kubernetes/kubectl/blob/master/pkg/describe/describe.go#L3581-L3735

Thus adding oc describe node[s] to must-gather isn't extending value (all it does is duplicate the data we 'save'; which increases archive sizes).

sferich888 commented 1 year ago

/close

openshift-ci[bot] commented 1 year ago

@sferich888: Closing this issue.

In response to [this](https://github.com/openshift/must-gather/issues/337#issuecomment-1330801009): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

gmeghnag commented 1 year ago

[oc and kubectl] describe is a veneer that more or less pretty prints data from other objects (or api calls).

https://github.com/kubernetes/kubectl/blob/master/pkg/describe/describe.go#L201 https://github.com/kubernetes/kubectl/blob/master/pkg/describe/describe.go#L3581-L3735

Thus adding oc describe node[s] to must-gather isn't extending value (all it does is duplicate the data we 'save'; which increases archive sizes).

@sferich888 This is not true, adding the output of oc describe node <nodename> will add value as we will be able to check if some issues the customer is experiencing is related to the node being overcommitted:

$ oc describe $(oc get node -o name | head -1) | tail -11
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         735m (49%)    0 (0%)
  memory                      3094Mi (46%)  2016Mi (30%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:                       <none>

We cannot get the same information from the must-gather (it'll require to have requests and limits from all the pods in the cluster, but these info are not collected from the must-gather).

gmeghnag commented 1 year ago

/reopen

openshift-ci[bot] commented 1 year ago

@gmeghnag: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/openshift/must-gather/issues/337#issuecomment-1331406289): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

oarribas commented 1 year ago

/reopen

openshift-ci[bot] commented 1 year ago

@oarribas: Reopened this issue.

In response to [this](https://github.com/openshift/must-gather/issues/337#issuecomment-1332403614): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

oarribas commented 1 year ago

@sferich888 , a must-gather doesn't include the information of customer's pods as already said by @gmeghnag, only from "core" namespaces. With the output of oc describe nodes we can have useful information about the status of the nodes.

sferich888 commented 1 year ago

@oarribas / @gmeghnag what I am saying is that if you read files, in/from a must-gather; under: cluster-scoped-resources/core/nodes/*; you can/should be able to parse those files, you will be able to get at the node 'status'

status:
...
  allocatable:
    cpu: 9500m
    ephemeral-storage: "203489941959"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 31669688Ki
    pods: "250"
  capacity:
    cpu: "10"
    ephemeral-storage: 209179628Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 32923064Ki
    pods: "250"

These values make up exactly what you are trying to represent (and or what oc describe presents). xref: https://github.com/kubernetes/kubectl/blob/master/pkg/describe/describe.go#L4171-L4231

/close

openshift-ci[bot] commented 1 year ago

@sferich888: Closing this issue.

In response to [this](https://github.com/openshift/must-gather/issues/337#issuecomment-1332790516): >@oarribas / @gmeghnag what I am saying is that if you read files, in/from a must-gather; under: `cluster-scoped-resources/core/nodes/*`; you can/should be able to parse those files, you will be able to get at the node 'status' > >``` >status: >... > allocatable: > cpu: 9500m > ephemeral-storage: "203489941959" > hugepages-1Gi: "0" > hugepages-2Mi: "0" > memory: 31669688Ki > pods: "250" > capacity: > cpu: "10" > ephemeral-storage: 209179628Ki > hugepages-1Gi: "0" > hugepages-2Mi: "0" > memory: 32923064Ki > pods: "250" >``` > >These values make up exactly what you are trying to represent (and or what `oc describe` presents). >xref: https://github.com/kubernetes/kubectl/blob/master/pkg/describe/describe.go#L4171-L4231 > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

gmeghnag commented 1 year ago

@sferich888 they are not the same values, the values you mentioned are the allocatable cpu and memory by the node; The following, instead are the allocated, basically the sum of the requests cpu/memory and limits cpu/memory for all the pods running on the node [1]:

$ oc describe $(oc get node -o name | head -1) | tail -11
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         845m (56%)    500m (33%)
  memory                      3244Mi (48%)  2516Mi (37%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:                       <none>

As an example as you can see the values for the memory are different:

$ oc describe $(oc get node -o name | head -1) | egrep "Capacity|memory:|Allocatable|Allocated|Resource|  memory"
Capacity:
  memory:                      7935220Ki
Allocatable:
  memory:                      6784244Ki
Allocated resources:
  Resource                    Requests      Limits
  memory                      3244Mi (48%)  2516Mi (37%)

I hope the difference is now clear.

[1] https://github.com/kubernetes/kubectl/blob/master/pkg/describe/describe.go#L4233-L4255

palonsoro commented 1 year ago

One typical policy/habit/call-it-as-you-wish of must-gather is to prefer machine-parsable over human-readable information (as it is more compact and can be made human-readable later) and avoid redundancy. At least, this is what I was told in the past (or I understood it that way).

So the right way to go here, if that information is desired, would be to get all the pods on the cluster and then have omc or insights or whatever parse the information in the way oc describe does.

The only problem I'd see with this approach, however, is that oc adm inspect should not be used for this, because that would gather all the pod logs and that would be horrible.

palonsoro commented 1 year ago

Another problem would be that we might be gathering too much information. Maybe must-gather can get the pod info, parse it to caculate the numbers and attach it.

But, it should still output some compact JSON or YAML file in a defined format, not a human-readable describe output subject to change between versions that may be a hell to parse by whoever wants to use that information.

oarribas commented 1 year ago

Opened #339 to collect the same information in a different way than oc describe nodes.

openshift / must-gather

[RFE] Include the output of oc describe nodes in the must-gather #337