smarterclayton commented 4 years ago

Enhancement Description

Administrators of Kube need a metric that correctly describes resource requests from pods and hides the complexities of the evolving pod resource model.
Kubernetes Enhancement Proposal: https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/1748-pod-resource-metrics/README.md
Primary contact (assignee): @smarterclayton
Responsible SIGs: sig-instrumentation, sig-node, sig-scheduling
Enhancement target (which target equals to which milestone):
- [x] Alpha release target (1.20)
- [x] Beta release target (1.21)
- [ ] Stable release target (1.27)
- [X] KEP (k/enhancements) update PR(s):
  - https://github.com/kubernetes/enhancements/pull/3810
- [X] Code (k/k) update PR(s):
  - https://github.com/kubernetes/kubernetes/pull/115454
- [ ] Docs (k/website) update(s):
  - https://github.com/kubernetes/website/pull/39998

smarterclayton commented 4 years ago

/sig instrumentation /sig node /sig scheduling

harshanarayana commented 4 years ago

Hey there @smarterclayton -- 1.19 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating in 1.19?

In order to have this part of the release:

The KEP PR must be merged in an implementable state
The KEP must have test plans
The KEP must have graduation criteria.

The current release schedule is:

Monday, April 13: Week 1 - Release cycle begins
Tuesday, May 19: Week 6 - Enhancements Freeze
Thursday, June 25: Week 11 - Code Freeze
Thursday, July 9: Week 14 - Docs must be completed and reviewed
Tuesday, August 4: Week 17 - Kubernetes v1.19.0 released

If you do, I'll add it to the 1.19 tracking sheet (http://bit.ly/k8s-1-19-enhancements). Once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 👍

Thanks!

smarterclayton commented 4 years ago

I don't think we'll make implementable and merged by Tuesday, so should be targeted for 1.20

harshanarayana commented 4 years ago

Hey @smarterclayton Thanks for confirming the inclusion state. I've marked the Enhancement as Deferred in the Tracker and updating the milestone accordingly.

/milestone v1.20

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

palnabarun commented 4 years ago

/remove-lifecycle stale

kikisdeliveryservice commented 4 years ago

Hi @smarterclayton !

Enhancements Lead here, do you still intend to target this for alpha in 1.20?

Thanks! Kirsten

smarterclayton commented 4 years ago

Yes, this is target alpha for 1.20 assuming we can close the remaining questions in the KEP

kikisdeliveryservice commented 4 years ago

Thanks Clayton!!

As a reminder, by Enhancements Freeze (October 6th), KEPs must be:

merged in an implementable state (yours is provisional)
must have test plans (missing)
must have graduation criteria (missing)

Best, Kirsten

I also added the PR link to the Issue description we can update again once merged.

mikejoh commented 4 years ago

Hi @smarterclayton :wave:!

I'm one of the Enhancement shadows for the 1.20 release cycle. This is a friendly reminder that the Enhancement freeze is on the 6th of October, i'm repeating the requirements needed by then:

The KEP must be merged in an implementable state.
- It's provisional at the moment and i see that there's active work ongoing.
The KEP must have test plans.
- This is missing at the moment.
The KEP must have graduation criteria(s).
- This is also missing at the moment.

Thanks!

smarterclayton commented 4 years ago

Thanks for the reminder, updated those. Will be working with the sig.

kikisdeliveryservice commented 4 years ago

The current PR looks complete from a enhancements freeze POV, we'll monitor to see if it merges in time.

kikisdeliveryservice commented 4 years ago

Hi @smarterclayton

Enhancements Freeze is now in effect. Unfortunately, your KEP PR has not yet merged. If you wish to be included in the 1.20 Release, please submit an Exception Request as soon as possible.

Best, Kirsten 1.20 Enhancements Lead

kikisdeliveryservice commented 4 years ago

An exception request was granted, so please ensure that all required changes are made and your PR is merged by the deadline listed in your request.

Thank! Kirsten

cc: @mikejoh

somtochiama commented 4 years ago

Hello @smarterclayton , 1.20 Docs shadow here 👋🏽. Does this enhancement work planned for 1.20 require any new docs or modification to existing docs?

If so, please follows the steps here to open a PR against dev-1.20 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Nov 6th

Also take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release. Thank you!

somtochiama commented 4 years ago

Hi @smarterclayton The docs placeholder deadline is almost here. Please make sure to create a placeholder PR against the dev-1.20 branch in the k/website before the deadline.

Also, please keep in mind the important upcoming dates:

Friday, Nov 6th: Week 8 - Docs Placeholder PR deadline
Thursday, Nov 12th: Week 9 - Code Freeze
Monday, Nov 23rd: Week 11 - Docs PR Ready for Review As a reminder, please link all of your k/k PR as well as docs PR to this issue for the release team to track.

Thank you.

smarterclayton commented 3 years ago

Docs PR is created

kikisdeliveryservice commented 3 years ago

Hey @smarterclayton

Does this KEP require both https://github.com/kubernetes/kubernetes/pull/94866 and https://github.com/kubernetes/kubernetes/pull/95839 or just the first?

Thanks! Kirsten

ehashman commented 3 years ago

/remove-sig node

ehashman commented 3 years ago

/milestone v1.21

ehashman commented 3 years ago

@smarterclayton can you please update your KEP to target beta for 1.21? There's a small metadata update and a PRR review that needs to happen I think.

smarterclayton commented 3 years ago

https://github.com/kubernetes/enhancements/pull/2417 is open, thanks for reminding

JamesLaverack commented 3 years ago

Hey @smarterclayton , enhancements 1.21 shadow here,

Enhancements Freeze is 2 days away, Feb 9th EOD PST

The enhancements team is aware that KEP update is currently in progress (PR #2417). Please make sure to work on PRR questionnaires and requirements and get it merged before the freeze. For PRR related questions or to boost the PR for PRR review, please reach out in Slack on the #prod-readiness channel.

Any enhancements that do not complete the following requirements by the freeze will require an exception.

[DONE] The KEP must be merged in an implementable state
[DONE] The KEP must have test plans
[DONE] The KEP must have graduation criteria
[IN PROGRESS] The KEP must have a production readiness review

annajung commented 3 years ago

With PR https://github.com/kubernetes/enhancements/pull/2417 merged, this enhancement meets all the criteria for the Enhancements freeze 👍

JamesLaverack commented 3 years ago

Hi @smarterclayton,

Since your Enhancement is scheduled to be in 1.21, please keep in mind the important upcoming dates:

Tuesday, March 9th: Week 9 — Code Freeze
Tuesday, March 16th: Week 10 — Docs Placeholder PR deadline
- If this enhancement requires new docs or modification to existing docs, please follow the steps in the Open a placeholder PR doc to open a PR against k/website repo.

As a reminder, please link all of your k/k PR(s) and k/website PR(s) to this issue so we can track them.

Thanks!

JamesLaverack commented 3 years ago

Hi @smarterclayton,

Enhancements team is marking this enhancement as "At Risk" for the upcoming code freeze due to not seeing any linked k/k PR(s) for this enhancement.

Please make sure to provide all k/k PR(s) and k/website PR(s) to this issue so it can be tracked by the release team.

ehashman commented 3 years ago

@JamesLaverack we've determined there are no code changes required for graduation to beta. We are updating documentation.

JamesLaverack commented 3 years ago

@ehashman Thank you for the clarification. I've now marked this enhancment as "Tracked" and done for 1.21.

ehashman commented 3 years ago

/milestone clear

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ehashman commented 3 years ago

/remove-lifecycle rotten

smarterclayton commented 3 years ago

Canvasing the community to get feedback before GA promotion.

https://www.reddit.com/r/kubernetes/comments/pgn4sj/feedback_wanted_on_pod_resource_metrics_before_ga/

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ehashman commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Huang-Wei commented 2 years ago

@smarterclayton I just realized this metric has a pod label, which IMO increase the cardinality a lot and yield a pressure on the scraper side. Did you hear any concern/feedback from the users? Per the KEP, all the goals can be satisfied by removing the pod dimension as in terms of a metric, its primary goal is to give a high-level overview on aggregated pods' reqs/limits. Pod-level metric doesn't seem that common. WDYT?

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

logicalhan commented 2 years ago

/remove-lifecycle rotten

logicalhan commented 2 years ago

@smarterclayton any plans to graduate this to beta?

logicalhan commented 2 years ago

If there is no one working on this, we will have to deprecate and remove this stuff. Alternatively, we will need to find someone to graduate this.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dashpole commented 2 years ago

/lifecycle frozen I think we might need to look for a new owner to drive this work in 1.26

logicalhan commented 2 years ago

/lifecycle frozen

Huang-Wei commented 2 years ago

@dashpole @logicalhan do you happen to find some volunteers to continue the work?

smarterclayton commented 2 years ago

Oh man, we didn't take this to beta?! This is my fault. Let me talk to @dgrisonnet who pinged me about it a day ago - originally the delay was gathering feedback from admins doing capacity planning, and I had been working with a few people on leveraging it more widely.

The use I was most familiar with was OpenShift and we replaced the dashboards that were using the (old, incorrect, not complete) kube-state-metrics for this - among the folks who did the change there was general agreement that the new metrics were superior and the cost of cardinality was worth it to replace the generally incorrect metrics from kube-state-metrics (at the time we felt that completely replicating the pod resource model code in ksm was not appropriate, and this was a better solution).. Next phase was getting community user input on building metric based capacity dashboards and whether the dimensions worked for the audience. I did a few analysis when planning out e2e CI runs and found the metrics provided better human visibility of comparing bulk "used vs requested".

smarterclayton commented 2 years ago

@Huang-Wei re:

I just realized this metric has a pod label, which IMO increase the cardinality a lot and yield a pressure on the scraper side. Did you hear any concern/feedback from the users? Per the KEP, all the goals can be satisfied by removing the pod dimension as in terms of a metric, its primary goal is to give a high-level overview on aggregated pods' reqs/limits. Pod-level metric doesn't seem that common. WDYT?

The original intent was to allow admins to build capacity planning dashboards, and to pair the resource vs pod level resource metrics (like cpu, memory, etc). So the intent was very much to have a pod dimension. Do we have a proposal to remove or make optional pod level cpu consumption or memory consumption dimensions? If so, such a change would apply to this metric as well, but as this is already an optional endpoint for users who are concerned about cardinality.

smarterclayton commented 2 years ago

To clarify - this is in beta since 1.21 (https://github.com/kubernetes/enhancements/issues/1748#issuecomment-791052241). Was there some belief that it was not beta?

It would be last step to go to GA, I'm happy to push that over the line with @dgrisonnet

dgrisonnet commented 2 years ago

I also thought this was still in Alpha for some reason even though we have a label marking the stability :sweat_smile:

Yet let's try to get this over the finish and gather some feedback from users to know if they encountered any issues with these new metrics.

/assign @smarterclayton @dgrisonnet

dgrisonnet commented 2 years ago

Do we have a proposal to remove or make optional pod level cpu consumption or memory consumption dimensions? If so, such a change would apply to this metric as well, but as this is already an optional endpoint for users who are concerned about cardinality.

We already have a cardinality protection mechanism in Kubernetes: https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2305-metrics-cardinality-enforcement so users could already tweak the dimensions if needed. That, plus the fact that the endpoint is optional, it sounds fairly safe to expose without having to worry about potential cardinality explosions.

kubernetes / enhancements

Expose metrics about resource requests and limits that represent the pod model #1748

Enhancement Description