Closed kotarusv closed 5 months ago
Hi @kotarusv Thanks for your interest in this script, and your feedback.
Could you please provide more details about the timeout? I'm guessing that the script is timing out on the collection of the metrics (line 66)
So just to review the issue, can you please, provide the following information?
bash -x ocadmtop_node.sh -d 6
This will run the command in debug mode, with also additional bash outputs, the last 20 lines will hopefully contain the last API call and the timeout issue.
Otherwise, adding a timeout is totally possible as the oc
command has a customizable request-timeout option:
--request-timeout='0':
The length of time to wait before giving up on a single server request. Non-zero values should contain a
corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests.
Sorry for late reply. This is the info you asked
time ocadmtop_node.sh -H <node_name>
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)
=====
real 1m3.686s user 0m0.338s sys 0m0.104s
2. nodes count in the cluster
```bash
kubectl get nodes | wc -l
248
kubectl get pods -A | wc -l
17226
bash -x ocadmtop_node.sh -H <node_name> -d 6
+ VERSION=0.3
+ getopts cd:mpL:H:Avh option
+ case ${option} in
++ echo np-rtp-01-wt475-dedicated0-int-router-1a-c46kv
++ sed -e 's!,! !'
+ NODES=np-rtp-01-wt475-dedicated0-int-router-1a-c46kv
+ getopts cd:mpL:H:Avh option
+ case ${option} in
+ [[ 6 =~ ^[1-9]$ ]]
+ LOGLEVEL='--loglevel 6'
+ getopts cd:mpL:H:Avh option
+ [[ -z np-rtp-01-wt475-dedicated0-int-router-1a-c46kv ]]
++ oc adm top pod -A --loglevel 6
I0223 17:03:48.638211 192860 loader.go:373] Config loaded from file: /home/ubuntu/.kube/config
I0223 17:03:48.810602 192860 round_trippers.go:553] GET https://api.example.com:6443/api 200 OK in 170 milliseconds
I0223 17:03:48.860584 192860 round_trippers.go:553] GET https://api.example.com:6443/apis 200 OK in 49 milliseconds
I0223 17:04:48.911629 192860 round_trippers.go:553] GET https://api.example.com:6443/apis/metrics.k8s.io/v1beta1/pods 504 Gateway Timeout in 60050 milliseconds
I0223 17:04:48.912299 192860 helpers.go:246] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)",
"reason": "Timeout",
"details": {
"group": "metrics.k8s.io",
"kind": "pods",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within the allotted timeout\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}"
}
]
},
"code": 504
}]
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)
+ RESOURCES=
[...]
It is due to unable to get response on time
"metadata": {},
"status": "Failure",
"message": "the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)",
"reason": "Timeout",
"details": {
"group": "metrics.k8s.io",
"kind": "pods",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within the allotted timeout\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}"
}
]
},
"code": 504
}]
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)
Hi @kotarusv,
Thanks for sharing the outputs.
As I was guessing the oc adm top pod
was the command who is timing out after 60 sec.
I0223 17:04:48.911629 192860 round_trippers.go:553] GET https://api.example.com:6443/apis/metrics.k8s.io/v1beta1/pods 504 Gateway Timeout in 60050 milliseconds
I've implemented the timeout option, in the version 0.4 of the script, for all oc
commands in the script, using the -t
option. Please download the latest version.
As documented in the oc
command, the option will work as following:
The length of time to wait before giving up on a single server request. Non-zero values should contain a
corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests.
The default value remain 1 minute, but you should now be able to adjust it.
I hope this will solve your issue. Please let me know how it's going. Otherwise I will close this issue next week.
Thanks. Vincent
thanks for looking and sorry for late reply. am still seeing the issue
ocadmtop_node.sh -t 2m -H
=====
ocadmtop_node.sh -t 0 -H
=====
ocadmtop_node.sh -H
=====
Hi @kotarusv
Sorry to hear this! Can you please confirm that you are running the CLI matching your cluster version?
Can you please share the output of the command:
time oc adm top pod -A --request-timeout='0' --loglevel 8
Cheers, Vincent
time oc adm top pod -A --request-timeout='0' --loglevel 8
I0309 17:24:51.347172 352019 loader.go:373] Config loaded from file: /home/ubuntu/.kube/config
I0309 17:24:51.350775 352019 round_trippers.go:463] GET https://api.example.com:6443/api
I0309 17:24:51.350894 352019 round_trippers.go:469] Request Headers:
I0309 17:24:51.350966 352019 round_trippers.go:473] Authorization: Bearer
real 1m0.411s user 0m0.122s sys 0m0.037s
Ok so it seems that the timeout was ignored. Can you run:
time oc adm top pod -A --request-timeout='15m' --loglevel 8 2>&1 | grep -E 'GET|Response Status' | sed -e 's#//[-a-Z.0-9]*:6443#//api.example.com:6443#'
To check if it also ignoring a 15 minutes timeout setting, or if the request is taking more than 15 minutes
Hi @kotarusv
Have you reviewed my last comment? Can you please provide an update on the request and let me know if you still need assistance with this issue?
Cheers,
I apologize for the delay. I was away last week. Here is the reply you requested. I receive complete output from small clusters, such as pods CPU and memory. However, in a large cluster, I do not receive any output except the response provided below.
time oc adm top pod -A --request-timeout='15m' --loglevel 8 2>&1 | grep -E 'GET|Response Status' I0408 03:23:17.071138 600193 round_trippers.go:463] GET https://api.com:6443/api?timeout=15m0s I0408 03:23:17.245076 600193 round_trippers.go:574] Response Status: 200 OK in 173 milliseconds I0408 03:23:17.245340 600193 round_trippers.go:463] GET https://api.com:6443/apis?timeout=15m0s I0408 03:23:17.297129 600193 round_trippers.go:574] Response Status: 200 OK in 51 milliseconds I0408 03:23:17.298310 600193 round_trippers.go:463] GET https://api.com:6443/apis/metrics.k8s.io/v1beta1/pods I0408 03:23:57.879896 600193 round_trippers.go:574] Response Status: 200 OK in 40581 milliseconds
real 0m41.425s user 0m0.490s sys 0m0.337s
Thanks a lot for that output. Actually I just realize that the metrics query don't have the timeout option set, meaning that this query will use the default 60 seconds timeout. I will raise an issue for this bug and provide you with an update when fixed.
Cheers,
Hey @kotarusv (@skotaru-cisco)
I'm working with the engineering team to determine where the issue is. I will keep you updated ASAP.
In the meantime, I'm working on a workaround. So can you please try this command on your largest cluster?
$ time oc get --request-timeout=15m --raw "/apis/metrics.k8s.io/v1beta1/pods" --loglevel 8 2>&1 | grep -E 'GET|Response Status'
$ time oc get --request-timeout=15m--raw "/apis/metrics.k8s.io/v1beta1/pods" | jq -r '.items | length'
The first command should bypass the 2 first API calls and only focus on the metrics, with the right timeout (15min) The second command will simply provide the number of POD with metrics running in your cluster.
Cheers,
@vlours Thanks for looking. Here is the requested info. Feel free to ask me if any additional info is required.
Thanks @vlours
time oc get --request-timeout=15m --raw "/apis/metrics.k8s.io/v1beta1/pods" --loglevel 8 2>&1 | grep -E 'GET|Response Status' I0416 15:20:35.029564 706443 round_trippers.go:463] GET https://api.example.com:6443/apis/metrics.k8s.io/v1beta1/pods?timeout=15m0s I0416 15:21:28.071438 706443 round_trippers.go:574] Response Status: 200 OK in 53041 milliseconds
real 0m53.311s user 0m0.173s sys 0m0.029s
time oc get --raw '/apis/metrics.k8s.io/v1beta1/pods' --request-timeout=15m | jq -r '.items | length' 10348
real 0m37.744s user 0m0.304s sys 0m0.093s
@kotarusv
Thanks for the outputs.
it looks like that you can retrieve the raw data within less than 60 seconds, assuming that the oc adm top pod
may sometimes worked as you are very close to the default timeout.
I've change the script to work with the Raw data instead of the oc adm top pod
command, allowing it to get the timeout set correctly.
So you should be good to go with the version 0.5 of the script, merged in the main branch.
Like said, based on the previous output, setting the timeout option to 2 minutes should be enough in your case. But feel free to adjust it.
Let me know how you are going with this new version.
Waiting to hear from you.
cheers,
Thank you so much for relentlessly working on this issue. It looks like it's working this time. I didn't provide any timeout option. I am able to see the output both by default (all nodes listing) as well as the single node listing using the -H option.
Is the -H option taking more time because it is looping through all the nodes to find the given node?
time ocadmtop_node.sh -H np-rtp-01-wt475-master-3
===== np-rtp-01-wt475-master-1 =====
real 0m51.444s user 0m2.108s sys 0m0.432s
The -H
option should be quicker as it should only retrieve the oc describe node
for the desired nodes.
But on the overall, this should take almost the same time to run the script with any option, as in your case the latency is coming from the metric collection taking around 50 seconds. But if you are running the script twice in a raw, the second run may be quicker as some data will already been cached on your local server.
I hope this answer your question
Thanks. I'm good now. It is working as expected.
I could see version command giving beow errors. Pl check
ocadmtop_node.sh -h
ocadmtop_node.sh [-c|-m|-p] [-A|-L
I would suggest one improvement if possible. Add the -n flag to pass the namespace and sort by memory and CPU. We can still use oc adm top pod -n <namespace>
, but this script can do all the functions.
Srinivas Kotaru
Release V0.6 is including the -n
option to filter on a namespace, as part of the RFE #6 and performance improvement:
As it includes a POD data collection from oc get pod -A -o wide
instead of multiple oc describe node
The namespace filtering should also improve the performance, as it applies to both metric and pod lists.
Please download the latest version and let me know if this is working as expected. If you have any other needs, feel free to raise another Issue (same as I did from your namespace request in #6 )
Cheers,
ocadmtop_node.sh -H cae-ai-rcdn-dev-22vtl-worker-0-hxv9k
===== cae-ai-rcdn-dev-22vtl-worker-0-hxv9k =====
awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtin
namespace' as variable name
awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtin
namespace' as variable name
awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtin
namespace' as variable name
awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtin
namespace' as variable name
awk: fatal: cannot use gawk builtin namespace' as variable name awk: fatal: cannot use gawk builtin
namespace' as variable name
awk: fatal: cannot use gawk builtin `namespace' as variable name
am using ubuntu based workstation to run my commands.
Srinivas Kotaru
Thanks @kotarusv for the feedback. 🙇 And sorry for the inconvenience. This error did not occurred in my test environment.
I've patched the script to use alternate variable names in PR #8 . They should not conflict anymore with the built-in function on Ubuntu. Please download the latest version: 0.6.1
Perfect. It is working as expected. Tested all nodes, single node, and namespace. I like the namespace output as it shows which nodes this namespace's pods are running. Probably this is useful for client namespaces but too verbose for any infra namespace which has daemonset pods running. These kind of situations, i still like the oc adm top pod -n openshift-monitoring --sort-by=cpu/memory output
Thank you very much for patiently fixing all the issues reported. I really appreciate it.
Thanks for your feedback! 🙇 I'm closing this issue form now. Feel free to raise a new issue/RFE, if you are facing a new issue or are looking for a new functionality.
I've already planned to add a new feature in the script to display the stats at the containers level (RFE #4 ).
And If you like this script and want to support it, please click on the star button in the repository page https://github.com/vlours/ocadmtop
Hello
I found the script helpful for troubleshooting. However, the script times out when running on a busy or large cluster. Can we add an option to set the timeout or increase the default timeout?
I tested the Openshift 4.12 cluster. Our clusters are large and have hundreds of nodes with many pods on each node.
Below is the error for even a single node. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)
Srinivas Kotaru