Address scalability issue when Node Watcher is enabled

xing-yang commented 3 years ago

We have an issue https://github.com/kubernetes-csi/external-health-monitor/issues/75 to change the code to only watch Pods and Nodes when the Node Watcher component is enabled. We still need to address the scalability issue when Node Watcher is enabled:

kubernetes/kubernetes#102452 (comment)

xing-yang commented 3 years ago

@NickrenREN I wonder if you've seen a similar issue in production.

NickrenREN commented 3 years ago

Node Watcher is a single instance controller, what is the scalability issue ?

xing-yang commented 3 years ago

@NickrenREN It affects the e2e tests. Details are in this issue: https://github.com/kubernetes/kubernetes/issues/102452

By disabling the external-health-monitor, the failure went away.

NickrenREN commented 3 years ago

IIUC, the root cause of the scalability issue you mention is: Node Watcher watches PVCs, Nodes and Pods ? I just don't understand the reason. k8s default scheduler also does the same thing.

NickrenREN commented 3 years ago

Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?

NickrenREN commented 3 years ago

I saw many API Throttlings, so maybe we can decrease the API call frequency ?

xing-yang commented 3 years ago

Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?

This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.

I saw many API Throttlings, so maybe we can decrease the API call frequency ?

We could try that.

NickrenREN commented 3 years ago

This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.

This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.

xing-yang commented 3 years ago

This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.

The external-health-monitor controller added more load to the API server which might have triggered those failures.

NickrenREN commented 3 years ago

The external-health-monitor controller added more load to the API server which might have triggered those failures.

I agree, so we can try to decrease the API call frequency first.

sonasingh46 commented 3 years ago

I would like to work on this issue. Will start to look into it and understand.

sonasingh46 commented 3 years ago

/assign

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

xing-yang commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-csi/external-health-monitor/issues/76#issuecomment-1119269280): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

pohly commented 2 years ago

/reopen

k8s-ci-robot commented 2 years ago

@pohly: Reopened this issue.

In response to [this](https://github.com/kubernetes-csi/external-health-monitor/issues/76#issuecomment-1208063121): >/reopen > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

pohly commented 2 years ago

/lifecycle frozen

mowangdk commented 3 weeks ago

/assign

kubernetes-csi / external-health-monitor

Address scalability issue when Node Watcher is enabled #76