kubernetes-sigs / dashboard-metrics-scraper

Container to scrape, store, and retrieve a window of time from the Metrics Server.
Apache License 2.0
87 stars 39 forks source link

pod-list metrics time out with more than a handful of pods #14

Closed brandond closed 5 years ago

brandond commented 5 years ago

I'm trying to isolate an issue with the pod list in the dashboard not loading. The spinner just cycles forever. If I look at the browser's network trace, it appears that the request to retrieve pod usage statistics is hanging.

The k8s apiserver audit log says the request is failing with a 503:

{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "Request",
    "auditID": "cde9850b-36d7-4cf6-ab0b-e2625606a19a",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/kube-system/services/dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/twistlock/pod-list/twistlock-defender-supervisor6-5976f5c586-7xp9p,twistlock-console-supervisor6-5cd756f5c9-rsn9k,twistlock-defender-supervisor2-7c9d84c9d-wnd4s,twistlock-defender-supervisor3-845f9db9c4-rmgdx,twistlock-defender-supervisor4-8576b48cf8-p7mtn,twistlock-defender-supervisor5-5cf95f85d6-nll9w,twistlock-defender-supervisor1-6bd65ff946-lp2g2,twistlock-console-supervisor4-7b4479ddb5-kbcdm,twistlock-console-supervisor5-565467d7d4-hxdsx,twistlock-console-supervisor1-587c86f876-z77nx,twistlock-console-supervisor2-844d988d79-xsx4p,twistlock-console-supervisor3-5669d4797d-ltcfg,twistlock-console-central-b99d5f656-dhh4b/metrics/memory/usage",
    "verb": "get",
    "user": {
        "username": "system:serviceaccount:kube-system:kubernetes-dashboard",
        "uid": "a9f9c762-04af-11e9-8320-02f9e7f869f8",
        "groups": [
            "system:serviceaccounts",
            "system:serviceaccounts:kube-system",
            "system:authenticated"
        ]
    },
    "sourceIPs": [
        "10.12.135.186"
    ],
    "userAgent": "dashboard/v2.0.0-beta1",
    "objectRef": {
        "resource": "services",
        "namespace": "kube-system",
        "name": "dashboard-metrics-scraper",
        "apiVersion": "v1",
        "subresource": "proxy"
    },
    "responseStatus": {
        "metadata": {},
        "code": 503
    },
    "requestReceivedTimestamp": "2019-07-11T22:45:42.661024Z",
    "stageTimestamp": "2019-07-11T23:01:20.636933Z",
    "annotations": {
        "authorization.k8s.io/decision": "allow",
        "authorization.k8s.io/reason": "RBAC: allowed by RoleBinding \"kubernetes-dashboard-minimal/kube-system\" of Role \"kubernetes-dashboard-minimal\" to ServiceAccount \"kubernetes-dashboard/kube-system\""
    }
}

I've enabled debug-level logging in the scraper and I don't even see the request getting logged at all.

A few additional notes:

jeefy commented 5 years ago

/assign

brandond commented 5 years ago

I wrote a simple python script to brute-force the minimum length before the requests time out. It seems to be consistent and correlated to URI length >= 610 characters, at least within my dev environment:

Python script: https://gist.github.com/brandond/28153c1b5f823b6191e4cea68c680423 Results:

pods in kube-system: aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-dashboard-787f6fb4d8-qmsbp,kubernetes-metrics-scraper-86667748bb-zszjg,metrics-server-578dc65b48-b92fg,node-exporter-4fdsw,node-exporter-dmhk2,node-exporter-kt292,node-exporter-rws67,node-exporter-xlqfp,node-exporter-z8b8m,prometheus-0
Request timed out with len(pod-list)=447 len(uri)=610
        http://localhost:8001/api/v1/namespaces/kube-system/services/dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-d/metrics/memory/usage
pods in twistlock: twistlock-console-central-b99d5f656-dhh4b,twistlock-console-supervisor1-587c86f876-z77nx,twistlock-console-supervisor2-844d988d79-xsx4p,twistlock-console-supervisor3-5669d4797d-ltcfg,twistlock-console-supervisor4-7b4479ddb5-kbcdm,twistlock-console-supervisor5-565467d7d4-hxdsx,twistlock-console-supervisor6-5cd756f5c9-rsn9k,twistlock-defender-ds-4xwd6,twistlock-defender-ds-g7hb9,twistlock-defender-ds-h6gtz,twistlock-defender-ds-mm8cv,twistlock-defender-ds-x26dg,twistlock-defender-ds-xmnxs,twistlock-defender-supervisor1-6bd65ff946-nnk99,twistlock-defender-supervisor2-7c9d84c9d-9544w,twistlock-defender-supervisor3-845f9db9c4-wdswh,twistlock-defender-supervisor4-8576b48cf8-4bc4k,twistlock-defender-supervisor5-5cf95f85d6-z7jqs,twistlock-defender-supervisor6-5976f5c586-88lgt
Request timed out with len(pod-list)=449 len(uri)=610
        http://localhost:8001/api/v1/namespaces/kube-system/services/dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/twistlock/pod-list/twistlock-console-central-b99d5f656-dhh4b,twistlock-console-supervisor1-587c86f876-z77nx,twistlock-console-supervisor2-844d988d79-xsx4p,twistlock-console-supervisor3-5669d4797d-ltcfg,twistlock-console-supervisor4-7b4479ddb5-kbcdm,twistlock-console-supervisor5-565467d7d4-hxdsx,twistlock-console-supervisor6-5cd756f5c9-rsn9k,twistlock-defender-ds-4xwd6,twistlock-defender-ds-g7hb9,twistlock-defender-ds-h6gtz,twistlock-defender-ds-mm8cv,twistlock-def/metrics/memory/usage
brandond commented 5 years ago

Tested from within the cluster and the same thing occurs. Additionally, the timeout occurs even if I should hit one of the default handlers, tested by changing /api/ to /apx/.

OK:

http://dashboard-metrics-scraper:8000/api/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-/metrics/memory/usage

OK:

http://dashboard-metrics-scraper.kube-system:8000/api/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-/metrics/memory/usage

Timeout:

http://dashboard-metrics-scraper:8000/api/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-d/metrics/memory/usage

Timeout:

http://dashboard-metrics-scraper:8000/apx/v1/dashboard/namespaces/kube-system/pod-list/aws-node-6jgtg,aws-node-nkknm,aws-node-q9sch,aws-node-xpxxj,aws-node-znsrt,aws-node-zvddg,coredns-955588fc4-46krq,coredns-955588fc4-lf7pb,external-dns-79dd4f7cf5-v9hw2,grafana-5494847df5-vd2d9,kiam-agent-26kqr,kiam-agent-6xdp8,kiam-agent-bwfsk,kiam-agent-lj9hc,kiam-server-77djh,kiam-server-ps7qj,kube-proxy-827bf,kube-proxy-gm8tl,kube-proxy-hd7dl,kube-proxy-hst76,kube-proxy-qcp26,kube-proxy-z7bjh,kube-state-metrics-5975c6f6dc-qx7w4,kubernetes-d/metrics/memory/usage
brandond commented 5 years ago

Disregard - turns out this was caused by recent changes to the configuration of a security agent on the hosts. The agent claimed it was resetting the connection due to "URI Path Length Too Long", and then dropping all traffic after the reset due to "Packet on Closed Connection". Disabling the agent has resolved all issues.