pod-list metrics time out with more than a handful of pods

brandond commented 5 years ago

I'm trying to isolate an issue with the pod list in the dashboard not loading. The spinner just cycles forever. If I look at the browser's network trace, it appears that the request to retrieve pod usage statistics is hanging.

The k8s apiserver audit log says the request is failing with a 503:

{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "Request",
    "auditID": "cde9850b-36d7-4cf6-ab0b-e2625606a19a",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/kube-system/services/dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/twistlock/pod-list/twistlock-defender-supervisor6-5976f5c586-7xp9p,twistlock-console-supervisor6-5cd756f5c9-rsn9k,twistlock-defender-supervisor2-7c9d84c9d-wnd4s,twistlock-defender-supervisor3-845f9db9c4-rmgdx,twistlock-defender-supervisor4-8576b48cf8-p7mtn,twistlock-defender-supervisor5-5cf95f85d6-nll9w,twistlock-defender-supervisor1-6bd65ff946-lp2g2,twistlock-console-supervisor4-7b4479ddb5-kbcdm,twistlock-console-supervisor5-565467d7d4-hxdsx,twistlock-console-supervisor1-587c86f876-z77nx,twistlock-console-supervisor2-844d988d79-xsx4p,twistlock-console-supervisor3-5669d4797d-ltcfg,twistlock-console-central-b99d5f656-dhh4b/metrics/memory/usage",
    "verb": "get",
    "user": {
        "username": "system:serviceaccount:kube-system:kubernetes-dashboard",
        "uid": "a9f9c762-04af-11e9-8320-02f9e7f869f8",
        "groups": [
            "system:serviceaccounts",
            "system:serviceaccounts:kube-system",
            "system:authenticated"
        ]
    },
    "sourceIPs": [
        "10.12.135.186"
    ],
    "userAgent": "dashboard/v2.0.0-beta1",
    "objectRef": {
        "resource": "services",
        "namespace": "kube-system",
        "name": "dashboard-metrics-scraper",
        "apiVersion": "v1",
        "subresource": "proxy"
    },
    "responseStatus": {
        "metadata": {},
        "code": 503
    },
    "requestReceivedTimestamp": "2019-07-11T22:45:42.661024Z",
    "stageTimestamp": "2019-07-11T23:01:20.636933Z",
    "annotations": {
        "authorization.k8s.io/decision": "allow",
        "authorization.k8s.io/reason": "RBAC: allowed by RoleBinding \"kubernetes-dashboard-minimal/kube-system\" of Role \"kubernetes-dashboard-minimal\" to ServiceAccount \"kubernetes-dashboard/kube-system\""
    }
}

I've enabled debug-level logging in the scraper and I don't even see the request getting logged at all.

A few additional notes:

I can replicate this with a simple curl request. Curl never sees the 503 from the apiserver, it just hangs like the browser does.
If I remove pods from the list, once the list gets under a certain length it starts responding

jeefy commented 5 years ago

/assign

brandond commented 5 years ago

I wrote a simple python script to brute-force the minimum length before the requests time out. It seems to be consistent and correlated to URI length >= 610 characters, at least within my dev environment:

Python script: https://gist.github.com/brandond/28153c1b5f823b6191e4cea68c680423 Results:

pods in kube-system: aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-dashboard-787f6fb4d8-qmsbp,kubernetes-metrics-scraper-86667748bb-zszjg,metrics-server-578dc65b48-b92fg,node-exporter-4fdsw,node-exporter-dmhk2,node-exporter-kt292,node-exporter-rws67,node-exporter-xlqfp,node-exporter-z8b8m,prometheus-0
Request timed out with len(pod-list)=447 len(uri)=610
        http://localhost:8001/api/v1/namespaces/kube-system/services/dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-d/metrics/memory/usage

pods in twistlock: twistlock-console-central-b99d5f656-dhh4b,twistlock-console-supervisor1-587c86f876-z77nx,twistlock-console-supervisor2-844d988d79-xsx4p,twistlock-console-supervisor3-5669d4797d-ltcfg,twistlock-console-supervisor4-7b4479ddb5-kbcdm,twistlock-console-supervisor5-565467d7d4-hxdsx,twistlock-console-supervisor6-5cd756f5c9-rsn9k,twistlock-defender-ds-4xwd6,twistlock-defender-ds-g7hb9,twistlock-defender-ds-h6gtz,twistlock-defender-ds-mm8cv,twistlock-defender-ds-x26dg,twistlock-defender-ds-xmnxs,twistlock-defender-supervisor1-6bd65ff946-nnk99,twistlock-defender-supervisor2-7c9d84c9d-9544w,twistlock-defender-supervisor3-845f9db9c4-wdswh,twistlock-defender-supervisor4-8576b48cf8-4bc4k,twistlock-defender-supervisor5-5cf95f85d6-z7jqs,twistlock-defender-supervisor6-5976f5c586-88lgt
Request timed out with len(pod-list)=449 len(uri)=610
        http://localhost:8001/api/v1/namespaces/kube-system/services/dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/twistlock/pod-list/twistlock-console-central-b99d5f656-dhh4b,twistlock-console-supervisor1-587c86f876-z77nx,twistlock-console-supervisor2-844d988d79-xsx4p,twistlock-console-supervisor3-5669d4797d-ltcfg,twistlock-console-supervisor4-7b4479ddb5-kbcdm,twistlock-console-supervisor5-565467d7d4-hxdsx,twistlock-console-supervisor6-5cd756f5c9-rsn9k,twistlock-defender-ds-4xwd6,twistlock-defender-ds-g7hb9,twistlock-defender-ds-h6gtz,twistlock-defender-ds-mm8cv,twistlock-def/metrics/memory/usage

brandond commented 5 years ago

Tested from within the cluster and the same thing occurs. Additionally, the timeout occurs even if I should hit one of the default handlers, tested by changing /api/ to /apx/.

OK:

http://dashboard-metrics-scraper:8000/api/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-/metrics/memory/usage

OK:

http://dashboard-metrics-scraper.kube-system:8000/api/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-/metrics/memory/usage

Timeout:

http://dashboard-metrics-scraper:8000/api/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-d/metrics/memory/usage

Timeout:

http://dashboard-metrics-scraper:8000/apx/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-d/metrics/memory/usage

brandond commented 5 years ago

Disregard - turns out this was caused by recent changes to the configuration of a security agent on the hosts. The agent claimed it was resetting the connection due to "URI Path Length Too Long", and then dropping all traffic after the reset due to "Packet on Closed Connection". Disabling the agent has resolved all issues.

kubernetes-sigs / dashboard-metrics-scraper

pod-list metrics time out with more than a handful of pods #14