vlours / ocadmtop

Script to collect the PODs' resources consumption and display them by node.
4 stars 0 forks source link

timeout #1

Closed kotarusv closed 5 months ago

kotarusv commented 7 months ago

Hello

I found the script helpful for troubleshooting. However, the script times out when running on a busy or large cluster. Can we add an option to set the timeout or increase the default timeout?

I tested the Openshift 4.12 cluster. Our clusters are large and have hundreds of nodes with many pods on each node.

Below is the error for even a single node. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)

Srinivas Kotaru

vlours commented 7 months ago

Hi @kotarusv Thanks for your interest in this script, and your feedback.

Could you please provide more details about the timeout? I'm guessing that the script is timing out on the collection of the metrics (line 66)

So just to review the issue, can you please, provide the following information?

Otherwise, adding a timeout is totally possible as the oc command has a customizable request-timeout option:

    --request-timeout='0':
    The length of time to wait before giving up on a single server request. Non-zero values should contain a
    corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests.
kotarusv commented 7 months ago

Sorry for late reply. This is the info you asked

  1. 
    time ocadmtop_node.sh -H <node_name> 
    Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)

===== ===== CPU MEM namespace/pod

real 1m3.686s user 0m0.338s sys 0m0.104s


2. nodes count in the cluster 
```bash
kubectl get nodes | wc -l
248

kubectl get pods -A | wc -l
17226
  1. debug output
    bash -x  ocadmtop_node.sh -H <node_name>  -d 6
    + VERSION=0.3
    + getopts cd:mpL:H:Avh option
    + case ${option} in
    ++ echo np-rtp-01-wt475-dedicated0-int-router-1a-c46kv
    ++ sed -e 's!,! !'
    + NODES=np-rtp-01-wt475-dedicated0-int-router-1a-c46kv
    + getopts cd:mpL:H:Avh option
    + case ${option} in
    + [[ 6 =~ ^[1-9]$ ]]
    + LOGLEVEL='--loglevel 6'
    + getopts cd:mpL:H:Avh option
    + [[ -z np-rtp-01-wt475-dedicated0-int-router-1a-c46kv ]]
    ++ oc adm top pod -A --loglevel 6
    I0223 17:03:48.638211  192860 loader.go:373] Config loaded from file:  /home/ubuntu/.kube/config
    I0223 17:03:48.810602  192860 round_trippers.go:553] GET https://api.example.com:6443/api 200 OK in 170 milliseconds
    I0223 17:03:48.860584  192860 round_trippers.go:553] GET https://api.example.com:6443/apis 200 OK in 49 milliseconds
    I0223 17:04:48.911629  192860 round_trippers.go:553] GET https://api.example.com:6443/apis/metrics.k8s.io/v1beta1/pods 504 Gateway Timeout in 60050 milliseconds
    I0223 17:04:48.912299  192860 helpers.go:246] server response object: [{
    "metadata": {},
    "status": "Failure",
    "message": "the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)",
    "reason": "Timeout",
    "details": {
    "group": "metrics.k8s.io",
    "kind": "pods",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within the allotted timeout\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}"
      }
    ]
    },
    "code": 504
    }]
    Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)
    + RESOURCES=
    [...]

It is due to unable to get response on time

  "metadata": {},
  "status": "Failure",
  "message": "the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)",
  "reason": "Timeout",
  "details": {
    "group": "metrics.k8s.io",
    "kind": "pods",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within the allotted timeout\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}"
      }
    ]
  },
  "code": 504
}]
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)
vlours commented 7 months ago

Hi @kotarusv,

Thanks for sharing the outputs. As I was guessing the oc adm top pod was the command who is timing out after 60 sec.

I0223 17:04:48.911629 192860 round_trippers.go:553] GET https://api.example.com:6443/apis/metrics.k8s.io/v1beta1/pods 504 Gateway Timeout in 60050 milliseconds

I've implemented the timeout option, in the version 0.4 of the script, for all oc commands in the script, using the -t option. Please download the latest version.

As documented in the oc command, the option will work as following:

    The length of time to wait before giving up on a single server request. Non-zero values should contain a
    corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests.

The default value remain 1 minute, but you should now be able to adjust it.

I hope this will solve your issue. Please let me know how it's going. Otherwise I will close this issue next week.

Thanks. Vincent

kotarusv commented 6 months ago

thanks for looking and sorry for late reply. am still seeing the issue

ocadmtop_node.sh -t 2m -H Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)

===== ===== CPU MEM namespace/pod

ocadmtop_node.sh -t 0 -H Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)

===== ===== CPU MEM namespace/pod

ocadmtop_node.sh -H Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

===== ===== CPU MEM namespace/pod

vlours commented 6 months ago

Hi @kotarusv

Sorry to hear this! Can you please confirm that you are running the CLI matching your cluster version?

Can you please share the output of the command:

time oc adm top pod -A --request-timeout='0' --loglevel 8

Cheers, Vincent

kotarusv commented 6 months ago

time oc adm top pod -A --request-timeout='0' --loglevel 8 I0309 17:24:51.347172 352019 loader.go:373] Config loaded from file: /home/ubuntu/.kube/config I0309 17:24:51.350775 352019 round_trippers.go:463] GET https://api.example.com:6443/api I0309 17:24:51.350894 352019 round_trippers.go:469] Request Headers: I0309 17:24:51.350966 352019 round_trippers.go:473] Authorization: Bearer I0309 17:24:51.351038 352019 round_trippers.go:473] Accept: application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList,application/json I0309 17:24:51.351103 352019 round_trippers.go:473] User-Agent: oc/4.14.0 (linux/amd64) kubernetes/9b1e0d2 I0309 17:24:51.545020 352019 round_trippers.go:574] Response Status: 200 OK in 193 milliseconds I0309 17:24:51.545061 352019 round_trippers.go:577] Response Headers: I0309 17:24:51.545071 352019 round_trippers.go:580] X-Kubernetes-Pf-Prioritylevel-Uid: b26aafec-7b2e-4f2f-9bbd-c49b6de791bb I0309 17:24:51.545083 352019 round_trippers.go:580] Content-Length: 137 I0309 17:24:51.545094 352019 round_trippers.go:580] Date: Sat, 09 Mar 2024 17:25:40 GMT I0309 17:24:51.545105 352019 round_trippers.go:580] Audit-Id: cd8eea90-31a6-49b6-b80f-e8bfed88ff16 I0309 17:24:51.545114 352019 round_trippers.go:580] Cache-Control: no-cache, private I0309 17:24:51.545123 352019 round_trippers.go:580] Content-Type: application/json I0309 17:24:51.545133 352019 round_trippers.go:580] Strict-Transport-Security: max-age=31536000; includeSubDomains; preload I0309 17:24:51.545142 352019 round_trippers.go:580] X-Kubernetes-Pf-Flowschema-Uid: 1cdda95a-00e8-4a2a-860f-994bf18fb17e I0309 17:24:51.545181 352019 request.go:1188] Response Body: {"kind":"APIVersions","versions":["v1"],"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"XX.XX.XX.XX:6443"}]} I0309 17:24:51.545359 352019 round_trippers.go:463] GET https://api.example.com:6443/apis I0309 17:24:51.545371 352019 round_trippers.go:469] Request Headers: I0309 17:24:51.545381 352019 round_trippers.go:473] Accept: application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList,application/json I0309 17:24:51.545393 352019 round_trippers.go:473] User-Agent: oc/4.14.0 (linux/amd64) kubernetes/9b1e0d2 I0309 17:24:51.545408 352019 round_trippers.go:473] Authorization: Bearer I0309 17:24:51.595687 352019 round_trippers.go:574] Response Status: 200 OK in 50 milliseconds I0309 17:24:51.595713 352019 round_trippers.go:577] Response Headers: I0309 17:24:51.595749 352019 round_trippers.go:580] Audit-Id: c4356f1d-472c-48c3-b25a-bd7b6f69e87f I0309 17:24:51.595762 352019 round_trippers.go:580] Cache-Control: no-cache, private I0309 17:24:51.595773 352019 round_trippers.go:580] Content-Type: application/json I0309 17:24:51.595784 352019 round_trippers.go:580] Strict-Transport-Security: max-age=31536000; includeSubDomains; preload I0309 17:24:51.595795 352019 round_trippers.go:580] X-Kubernetes-Pf-Flowschema-Uid: 1cdda95a-00e8-4a2a-860f-994bf18fb17e I0309 17:24:51.595824 352019 round_trippers.go:580] X-Kubernetes-Pf-Prioritylevel-Uid: b26aafec-7b2e-4f2f-9bbd-c49b6de791bb I0309 17:24:51.595836 352019 round_trippers.go:580] Date: Sat, 09 Mar 2024 17:25:40 GMT I0309 17:24:51.596127 352019 request.go:1188] Response Body: {"kind":"APIGroupList","apiVersion":"v1","groups":[{"name":"apiregistration.k8s.io","versions":[{"groupVersion":"apiregistration.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"apiregistration.k8s.io/v1","version":"v1"}},{"name":"apps","versions":[{"groupVersion":"apps/v1","version":"v1"}],"preferredVersion":{"groupVersion":"apps/v1","version":"v1"}},{"name":"events.k8s.io","versions":[{"groupVersion":"events.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"events.k8s.io/v1","version":"v1"}},{"name":"authentication.k8s.io","versions":[{"groupVersion":"authentication.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"authentication.k8s.io/v1","version":"v1"}},{"name":"authorization.k8s.io","versions":[{"groupVersion":"authorization.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"authorization.k8s.io/v1","version":"v1"}},{"name":"autoscaling","versions":[{"groupVersion":"autoscaling/v2","version":"v2"},{"groupVersion":"autoscaling/v1","version":"v1"},{ [truncated 19314 chars] I0309 17:24:51.596944 352019 round_trippers.go:463] GET https://api.example.com:6443/apis/metrics.k8s.io/v1beta1/pods I0309 17:24:51.596962 352019 round_trippers.go:469] Request Headers: I0309 17:24:51.596970 352019 round_trippers.go:473] Accept: application/vnd.kubernetes.protobuf, / I0309 17:24:51.596996 352019 round_trippers.go:473] User-Agent: oc/4.14.0 (linux/amd64) kubernetes/9b1e0d2 I0309 17:24:51.597009 352019 round_trippers.go:473] Authorization: Bearer I0309 17:25:51.647159 352019 round_trippers.go:574] Response Status: 504 Gateway Timeout in 60050 milliseconds I0309 17:25:51.647209 352019 round_trippers.go:577] Response Headers: I0309 17:25:51.647226 352019 round_trippers.go:580] Content-Type: text/plain; charset=utf-8 I0309 17:25:51.647244 352019 round_trippers.go:580] Content-Length: 152 I0309 17:25:51.647261 352019 round_trippers.go:580] Date: Sat, 09 Mar 2024 17:26:40 GMT I0309 17:25:51.647276 352019 round_trippers.go:580] Audit-Id: 16a2b6ab-8e2b-4cce-b571-62f37370a45f I0309 17:25:51.647292 352019 round_trippers.go:580] Cache-Control: no-cache, private I0309 17:25:51.647307 352019 round_trippers.go:580] Strict-Transport-Security: max-age=31536000; includeSubDomains; preload I0309 17:25:51.647345 352019 request.go:1188] Response Body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504} I0309 17:25:51.647681 352019 helpers.go:246] server response object: [{ "metadata": {}, "status": "Failure", "message": "the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)", "reason": "Timeout", "details": { "group": "metrics.k8s.io", "kind": "pods", "causes": [ { "reason": "UnexpectedServerResponse", "message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within the allotted timeout\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}" } ] }, "code": 504 }] Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)

real 1m0.411s user 0m0.122s sys 0m0.037s

vlours commented 6 months ago

Ok so it seems that the timeout was ignored. Can you run:

time oc adm top pod -A --request-timeout='15m' --loglevel 8 2>&1 | grep -E 'GET|Response Status' | sed -e 's#//[-a-Z.0-9]*:6443#//api.example.com:6443#'

To check if it also ignoring a 15 minutes timeout setting, or if the request is taking more than 15 minutes

vlours commented 5 months ago

Hi @kotarusv

Have you reviewed my last comment? Can you please provide an update on the request and let me know if you still need assistance with this issue?

Cheers,

skotaru-cisco commented 5 months ago

I apologize for the delay. I was away last week. Here is the reply you requested. I receive complete output from small clusters, such as pods CPU and memory. However, in a large cluster, I do not receive any output except the response provided below.

time oc adm top pod -A --request-timeout='15m' --loglevel 8 2>&1 | grep -E 'GET|Response Status' I0408 03:23:17.071138 600193 round_trippers.go:463] GET https://api.com:6443/api?timeout=15m0s I0408 03:23:17.245076 600193 round_trippers.go:574] Response Status: 200 OK in 173 milliseconds I0408 03:23:17.245340 600193 round_trippers.go:463] GET https://api.com:6443/apis?timeout=15m0s I0408 03:23:17.297129 600193 round_trippers.go:574] Response Status: 200 OK in 51 milliseconds I0408 03:23:17.298310 600193 round_trippers.go:463] GET https://api.com:6443/apis/metrics.k8s.io/v1beta1/pods I0408 03:23:57.879896 600193 round_trippers.go:574] Response Status: 200 OK in 40581 milliseconds

real 0m41.425s user 0m0.490s sys 0m0.337s

vlours commented 5 months ago

Thanks a lot for that output. Actually I just realize that the metrics query don't have the timeout option set, meaning that this query will use the default 60 seconds timeout. I will raise an issue for this bug and provide you with an update when fixed.

Cheers,

vlours commented 5 months ago

Hey @kotarusv (@skotaru-cisco)

I'm working with the engineering team to determine where the issue is. I will keep you updated ASAP.

In the meantime, I'm working on a workaround. So can you please try this command on your largest cluster?

$ time oc get --request-timeout=15m --raw "/apis/metrics.k8s.io/v1beta1/pods" --loglevel 8 2>&1 | grep -E 'GET|Response Status'
$ time oc get --request-timeout=15m--raw "/apis/metrics.k8s.io/v1beta1/pods" | jq -r '.items | length'

The first command should bypass the 2 first API calls and only focus on the metrics, with the right timeout (15min) The second command will simply provide the number of POD with metrics running in your cluster.

Cheers,

kotarusv commented 5 months ago

@vlours Thanks for looking. Here is the requested info. Feel free to ask me if any additional info is required.

Thanks @vlours

time oc get --request-timeout=15m --raw "/apis/metrics.k8s.io/v1beta1/pods" --loglevel 8 2>&1 | grep -E 'GET|Response Status' I0416 15:20:35.029564 706443 round_trippers.go:463] GET https://api.example.com:6443/apis/metrics.k8s.io/v1beta1/pods?timeout=15m0s I0416 15:21:28.071438 706443 round_trippers.go:574] Response Status: 200 OK in 53041 milliseconds

real 0m53.311s user 0m0.173s sys 0m0.029s

time oc get --raw '/apis/metrics.k8s.io/v1beta1/pods' --request-timeout=15m | jq -r '.items | length' 10348

real 0m37.744s user 0m0.304s sys 0m0.093s

vlours commented 5 months ago

@kotarusv

Thanks for the outputs. it looks like that you can retrieve the raw data within less than 60 seconds, assuming that the oc adm top pod may sometimes worked as you are very close to the default timeout.

I've change the script to work with the Raw data instead of the oc adm top pod command, allowing it to get the timeout set correctly. So you should be good to go with the version 0.5 of the script, merged in the main branch.

Like said, based on the previous output, setting the timeout option to 2 minutes should be enough in your case. But feel free to adjust it.

Let me know how you are going with this new version.

Waiting to hear from you.

cheers,

skotaru-cisco commented 5 months ago

Thank you so much for relentlessly working on this issue. It looks like it's working this time. I didn't provide any timeout option. I am able to see the output both by default (all nodes listing) as well as the single node listing using the -H option.

Is the -H option taking more time because it is looping through all the nodes to find the given node?

time ocadmtop_node.sh -H np-rtp-01-wt475-master-3

===== np-rtp-01-wt475-master-1 =====

real 0m51.444s user 0m2.108s sys 0m0.432s

vlours commented 5 months ago

The -H option should be quicker as it should only retrieve the oc describe node for the desired nodes.

But on the overall, this should take almost the same time to run the script with any option, as in your case the latency is coming from the metric collection taking around 50 seconds. But if you are running the script twice in a raw, the second run may be quicker as some data will already been cached on your local server.

I hope this answer your question

skotaru-cisco commented 5 months ago

Thanks. I'm good now. It is working as expected.

I could see version command giving beow errors. Pl check

ocadmtop_node.sh -h ocadmtop_node.sh [-c|-m|-p] [-A|-L ,,...|-H ,,...] [-d {0-10}] [-t ][-v|-h] -c: sort by CPU (default) -m: sort by Memory -p: sort by namespace/pod -L: retrieve node(s) matching all labels -H: retrieve node(s) by hostname -A: retrieve All nodes (default) -d: debug/loglevel mode. Provide additional 'oc --loglevel' ouput. (Recommended value: 6) -t: The length of time to wait before giving up on a single server request. Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests. -v: Display the version -h: Display this help /usr/local/bin/ocadmtop_node.sh: line 27: ${STD_ERR}: ambiguous redirect /usr/local/bin/ocadmtop_node.sh: line 28: ${STD_ERR}: ambiguous redirect ocadmtop_node.sh - Version: N/A

I would suggest one improvement if possible. Add the -n flag to pass the namespace and sort by memory and CPU. We can still use oc adm top pod -n <namespace>, but this script can do all the functions.

Srinivas Kotaru

vlours commented 5 months ago

Release V0.6 is including the -n option to filter on a namespace, as part of the RFE #6 and performance improvement: As it includes a POD data collection from oc get pod -A -o wide instead of multiple oc describe node The namespace filtering should also improve the performance, as it applies to both metric and pod lists.

Please download the latest version and let me know if this is working as expected. If you have any other needs, feel free to raise another Issue (same as I did from your namespace request in #6 )

Cheers,

skotaru-cisco commented 5 months ago

ocadmtop_node.sh -H cae-ai-rcdn-dev-22vtl-worker-0-hxv9k

===== cae-ai-rcdn-dev-22vtl-worker-0-hxv9k ===== awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtinnamespace' as variable name awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtinnamespace' as variable name awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtinnamespace' as variable name awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtinnamespace' as variable name awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtinnamespace' as variable name awk: fatal: cannot use gawk builtin `namespace' as variable name

am using ubuntu based workstation to run my commands.

Srinivas Kotaru

vlours commented 5 months ago

Thanks @kotarusv for the feedback. 🙇 And sorry for the inconvenience. This error did not occurred in my test environment.

I've patched the script to use alternate variable names in PR #8 . They should not conflict anymore with the built-in function on Ubuntu. Please download the latest version: 0.6.1

skotaru-cisco commented 5 months ago

Perfect. It is working as expected. Tested all nodes, single node, and namespace. I like the namespace output as it shows which nodes this namespace's pods are running. Probably this is useful for client namespaces but too verbose for any infra namespace which has daemonset pods running. These kind of situations, i still like the oc adm top pod -n openshift-monitoring --sort-by=cpu/memory output

Thank you very much for patiently fixing all the issues reported. I really appreciate it.

vlours commented 5 months ago

Thanks for your feedback! 🙇 I'm closing this issue form now. Feel free to raise a new issue/RFE, if you are facing a new issue or are looking for a new functionality.

I've already planned to add a new feature in the script to display the stats at the containers level (RFE #4 ).

And If you like this script and want to support it, please click on the star button in the repository page https://github.com/vlours/ocadmtop