Metrics scraper pod overloads and crashes

Joseph-Goergen commented 3 years ago

We have a cluster here that has 34 nodes and 2497 pod. The metrics scraper seemed to reach 5000m of cpu and 6.7G of memory before eventually crashing.

Dashboard version v2.0.5
Metric scraper version v1.0.6

The metrics scraper produces roughly 500000 log lines per hour and look like this

Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 874 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 875 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 878 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 888 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 892 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 891 "" "dashboard/v2.0.5"

It seems like it's handling the requests like it should, it's just getting overloaded and can't handle that very well. I think adding a cpu and memory limit wouldn't help very much because I think that also causes the pod to keep crashing once it hits it. This is about as much info as I have about it on this cluster. The user did delete the pod and it came back and overloaded and crashed again.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/dashboard-metrics-scraper/issues/38#issuecomment-869185325): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

Joseph-Goergen commented 3 years ago

/reopen

k8s-ci-robot commented 3 years ago

@Joseph-Goergen: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/dashboard-metrics-scraper/issues/38#issuecomment-869683634): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

Joseph-Goergen commented 3 years ago

/remove-lifecycle rotten

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Joseph-Goergen commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Joseph-Goergen commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

maciaszczykm commented 2 years ago

/lifecycle frozen

kubernetes-sigs / dashboard-metrics-scraper

Metrics scraper pod overloads and crashes #38