Sub-second / More granular probes

mikebrow commented 2 years ago

Enhancement Description

Allow Probe fields to be specified in seconds or milliseconds.
Kubernetes Enhancement Proposal: draft https://github.com/kubernetes/enhancements/pull/3067
Discussion Link: sig-node discussion notes
Primary contact (assignee): @mikebrow
Responsible SIGs: Sig-Node
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.29
- Beta release target (x.y):
- Stable release target (x.y):
[ ] Alpha
- [ ] KEP (k/enhancements) update PR(s): https://github.com/kubernetes/enhancements/pull/3067
- [ ] Code (k/k) update PR(s): https://github.com/kubernetes/kubernetes/pull/107958
- [ ] Docs (k/website) update PR(s):

k8s-ci-robot commented 2 years ago

@mikebrow: The label(s) sig/sig-node cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubernetes/enhancements/issues/3066#issuecomment-982974554): >/sig SIG-NODE Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-ci-robot commented 2 years ago

@mikebrow: The label(s) sig/sig-node cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubernetes/enhancements/issues/3066#issuecomment-982975488): >/sig sig-node Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mikebrow commented 2 years ago

/sig node

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

psschwei commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

psschwei commented 2 years ago

/remove-lifecycle stale

marosset commented 1 year ago

/milestone v1.26 /label lead-opted-in (I'm doing this on behalf of @ruiwen-zhao / SIG-node)

rhockenbury commented 1 year ago

/stage alpha /label tracked/yes

Atharva-Shinde commented 1 year ago

Hey @mikebrow 👋, 1.26 Enhancements team here!

Just checking in as we approach Enhancements Freeze on 18:00 PDT on Thursday 6th October 2022.

This enhancement is targeting for stage alpha for 1.26

Here's where this enhancement currently stands:

[ ] KEP file using the latest template has been merged into the k/enhancements repo.
[X] KEP status is marked as implementable
[ ] KEP has an updated detailed test plan section filled out
[X] KEP has up to date graduation criteria
[ ] KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would need to:

The KEP needs updating it's Test Plan Section to incorporate details as stated in the updated detailed test plan
- We need to include the acknowledgement which is missing in this enhancements Test Plan
Get this PR #3067 merged with required changes before Enhancements Freeze to make this enhancement eligible for 1.26 release.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you :)

Atharva-Shinde commented 1 year ago

Hello @mikebrow 👋, just a quick check-in again, as we approach the 1.26 Enhancements freeze.

Please plan to get the action items mentioned in my comment above done before Enhancements freeze on 18:00 PDT on Thursday 6th October 2022 i.e tomorrow

For note, the current status of the enhancement is marked at-risk :)

rhockenbury commented 1 year ago

Hello 👋, 1.26 Enhancements Lead here.

Unfortunately, this enhancement did not meet requirements for enhancements freeze.

If you still wish to progress this enhancement in v1.26, please file an exception request. Thanks!

/milestone clear /label tracked/no /remove-label tracked/yes /remove-label lead-opted-in

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

psschwei commented 1 year ago

/remove-lifecycle stale

SergeyKanzhelev commented 1 year ago

this KEP needs to answer how limitation of node resources around sockets would be addressed. See https://github.com/kubernetes/kubernetes/pull/115143 for details.

Nibelheims commented 1 year ago

Hello, I am very interested in this KEP. I happen to also wish for subsecond probes, and I was happy to stumble on this. I see there even is an implementation ! :) I have looked around, and found that in the corresponding discussion, @aojea also pointed this out.

Would enabling SO_REUSEADDR in addition to SO_LINGER(1), as you did, only on probe-related sockets (hence in your new ProbeDialer) be a good idea to address this? In case of ephemeral ports exhaustion, even with a TIME_WAIT state reduced to 1s with your improvement, it could allow the client side (prober) to reuse an existing socket (but with the risk of misinterpreting an old reply hitting a newer probe on a "recycled" ephemeral port)?

On Linux the net.ipv4.tcp_tw_reuse might be used to achieve the same, but this is Linux only.

SergeyKanzhelev commented 1 year ago

I am very interested in this KEP.

Curious, do you need it for startup, readiness, or liveness probe? Or all of them? What interval are you thinking about?

Nibelheims commented 1 year ago

Hello Sergey, I'd like to have subsecond delays/periods for all kinds of probes, in order to detect a failure as fast as possible. As explained in the KEP's README, the general idea would be to reduce latencies. I do not have a precise value in mind right now, but the current "second scale" is too coarce. Thank you.

mikebrow commented 1 year ago

nod being able to more precisely control the timing is a major part of the KEP and implementation.. If you know it takes 1.2 seconds to start up a DB.. it doesn't make sense to try at 1sec then 2sec or to wait for the 2 sec mark.. Instead maybe it would be better to wait for 1.5seconds? Totally depends on the the model being used and if they can switch to a ready on event push model instead of a state polling model.

Just needs SIG-NODE approval.. timing of this change vs all the other changes keeps pushing it back.. But I think it's ready any time the sigs are ready for it.

SergeyKanzhelev commented 1 year ago

The reason I'm asking is that for liveness probe and partially for readiness probes using streaming instead of pings may work even better. For http it may be some version of a long poll, for gRPC - streaming health service. Streaming may eliminate many scalability concerns. The only thing - it will not work well for startup and for readiness flipping back to Ready. Retrying to establish connection will be easier to do with the same coarseness of 1s+.

Nibelheims commented 1 year ago

Hello, I agree that streaming (in the sense of maintaining live the same socket for each probe?) would be preferable, unfortunately it is not always possible to make the application compliant. One will probably want to use this feature with some existing payload or applications they do not have developped. This would also require its own change in the probing mechanism.

A workaround might be possible by using sidecars: implement stream probes (using the same socket forever) targeting a sidecar which would, at its level, perform sub-second probes into the desired container. The sidecar would handle the persistent connection with the k8s stream probe, and would locally perform sub-second checks. This would move the problem from the kubelet into sidecars, which could look like "dissolving" the network overload. However this may seem overly complicated for a questionable result, since each physical node will still have to deal with more resource consumption.

SergeyKanzhelev commented 1 year ago

unfortunately it is not always possible to make the application compliant.

I understand this. I am worried about using Node network for subsecond probes. Maybe implementing the probes from the Pod's network or streaming can help with this.

SergeyKanzhelev commented 1 year ago

@mikebrow can you update the PR to indicate that you want it for 1.28.

SergeyKanzhelev commented 1 year ago

/label lead-opted-in

SergeyKanzhelev commented 1 year ago

/milestone v1.28

@mikebrow mentioned at sig node meeting he wants to see if it can be made to 1.28. Marking for the milestone to not loose it

salehsedghpour commented 1 year ago

Hello @mikebrow 👋, Enhancements team here.

Just checking in as we approach enhancements freeze on 01:00 UTC Friday, 16th June 2023.

This enhancement is targeting for stage alpha for 1.28 (correct me, if otherwise)

Here's where this enhancement currently stands:

[ ] KEP readme using the latest template has been merged into the k/enhancements repo.
[ ] KEP status is marked as implementable for latest-milestone: 1.28
[X] KEP readme has a updated detailed test plan section filled out
[ ] KEP readme has up to date graduation criteria
[ ] KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would just need to update the following:

The KEP requires to include the updated readme template.
Address questions inside the Production Readiness Review Questionnaire.
Update the latest-milestone in kep.yaml file to 1.28
Update the status to implementable in kep.yaml file.
Update the graduation criteria in the readme.
Ensure that the PRs are merged.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

salehsedghpour commented 1 year ago

Hi @mikebrow 👋, just checking in before the enhancements freeze on 01:00 UTC Friday, 16th June 2023.

The status for this enhancement is at risk.

For this KEP, we would just need to update the following:

The KEP requires to include the updated readme template.
Address questions inside the Production Readiness Review Questionnaire.
Update the latest-milestone in kep.yaml file to 1.28
Update the status to implementable in kep.yaml file.
Update the graduation criteria in the readme.
Ensure that the PRs are merged.

Let me know if I missed anything. Thanks!

mikebrow commented 1 year ago

@salehsedghpour

Hi @mikebrow 👋, just checking in before the enhancements freeze on 01:00 UTC Friday, 16th June 2023.

The status for this enhancement is at risk.

For this KEP, we would just need to update the following:

The KEP requires to include the updated readme template. done..

Address questions inside the Production Readiness Review Questionnaire.

done..

Update the latest-milestone in kep.yaml file to 1.28

done..

Update the status to implementable in kep.yaml file.

done needs approval..

Update the graduation criteria in the readme.

done needs approval..

Ensure that the PRs are merged.

thx wip..

Let me know if I missed anything. Thanks!

thank you nothing noted you were very through :-)

Atharva-Shinde commented 1 year ago

Hello @mikebrow 👋, 1.28 Enhancements Lead here. Unfortunately, this enhancement did not meet requirements for v1.28 enhancements freeze. Feel free to file an exception to add this back to the release tracking process. Thanks!

Atharva-Shinde commented 1 year ago

/milestone clear

SergeyKanzhelev commented 1 year ago

/milestone v1.29

(as discussed at SIG Node meeting this week)

mikebrow commented 1 year ago

updated kep to reflect milestone v1.29

salehsedghpour commented 1 year ago

Hello @mikebrow 👋, Enhancements team here.

Just checking in as we approach enhancements freeze on Friday, 6th October 2023.

This enhancement is targeting for stage alpha for 1.29 (correct me, if otherwise)

Here's where this enhancement currently stands:

[ ] KEP readme using the latest template has been merged into the k/enhancements repo.
[x] KEP status is marked as implementable for latest-milestone: 1.29.
[ ] KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here).

For this KEP, we would just need to update the following:

The latest read me template has more items in production readiness review questionnaire that need to be addressed.
Ensure that the PR including the production readiness review has been reviewed and merged into k/enhancements.

The status of this enhancement is marked as at risk for enhancement freeze. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

mikebrow commented 11 months ago

Hello @salehsedghpour, thx. I believe all the readme template issues are addressed in the update PR https://github.com/kubernetes/enhancements/pull/3067

for your convenience note this commit: https://github.com/kubernetes/enhancements/pull/3067/commits/7924ba213739250bfc575f2013a125fe645d3c9b

salehsedghpour commented 11 months ago

@mikebrow , thanks for the response. I just checked the readme template, I saw that in the latest readme template, there is this question that does not exist in https://github.com/kubernetes/enhancements/commit/7924ba213739250bfc575f2013a125fe645d3c9b. Please correct me if I'm wrong!

mikebrow commented 11 months ago

@mikebrow , thanks for the response. I just checked the readme template, I saw that in the latest readme template, there is this question that does not exist in 7924ba2. Please correct me if I'm wrong!

Nod that is almost the same question as asked right above..

[Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?](https://github.com/kubernetes/enhancements/issues/3066#will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components)

[Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?](https://github.com/kubernetes/enhancements/issues/3066#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc)

same response

Enabling / using this feature will result in changes to resource usage
 (CPU, RAM, disk, IO, `PIDs, sockets, inodes`...) in kubelet and runtime coponents. This KEP provides for
 mitigation of the changes.

Reducing the probe frequency to subsecond intervals will result in probes polling slightly more
frequently until success, as mitigated for exec probes and restricting to startup and readyness.

In a follow up KEP further mitigations and allowances may be considered based on resource
pressure, use cases for liveness probes, and if exec probe costs can be reduced via
architectural changes.

I can update if like.. seems repetitive

salehsedghpour commented 11 months ago

Yes, you are right. I'll ask for more information about this and get back to you.

With that being said, the only thing left is ensuring the PR is being merged into k/enhancements.

salehsedghpour commented 11 months ago

Hi @mikebrow , checking in once more as we approach the 1.29 enhancement freeze deadline on 01:00 UTC, Friday, 6th October, 2023. The status of this enhancement is marked as at risk. It looks like https://github.com/kubernetes/enhancements/pull/3067 will address all of the requirements.

About the questionnaire, I'll bring the discussion up about those questions. And you don't need to update it for alpha stage.

Let me know if I missed anything. Thanks!

npolshakova commented 11 months ago

Hello 👋, 1.29 Enhancements Lead here. Unfortunately, this enhancement did not meet requirements for v1.29 enhancements freeze. Feel free to file an exception to add this back to the release tracking process. Thanks!

/milestone clear

Maxattax97 commented 10 months ago

Is this still a priority?

Granular probe timings are essential for modern applications that demand precision and high reliability. The current "second scale" limits Kubernetes' responsiveness. Implementing this feature would significantly enhance Kubernetes' capabilities for a wide range of use cases.

mikebrow commented 10 months ago

Yes it is still a priority.

salehsedghpour commented 8 months ago

/remove-label lead-opted-in

SergeyKanzhelev commented 7 months ago

/stage alpha /milestone v1.30

salehsedghpour commented 7 months ago

Hello @mikebrow , 1.30 Enhancements team here! Is this enhancement targeting 1.30? If it is, can you follow the instructions here to opt in the enhancement and make sure the lead-opted-in label is set so it can get added to the tracking board? Thanks!

salehsedghpour commented 7 months ago

/milestone clear

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

dan-massie commented 2 months ago

Is this still being worked on?

Nibelheims commented 2 months ago

Is this still being worked on?

I hope so too, this KEP seemed to almost make it twice. Would like to help a little if possible and needed.

psschwei commented 2 months ago

/remove-lifecycle rotten

kubernetes / enhancements

Sub-second / More granular probes #3066

Enhancement Description