knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.57k stars 1.16k forks source link

Provide configuration option for revisions becoming Unschedulable #14862

Open SaschaSchwarze0 opened 9 months ago

SaschaSchwarze0 commented 9 months ago

/area autoscale

Describe the feature

In Bubble up pod schedule errors to revision status, the Revision reconciler was changed to propagate pod scheduling issues up to the Revision. This is done whenever a scale from 0 happens but not when a scale from for example 1 happens due to this condition.

I generally can understand the reason behind this for a Knative installation where users have full control and where the cluster size is fixed.

We are running Knative as a managed service with cluster autoscaling. There is actually no way to get to a Pod that cannot be scheduled. Even if suddenly there is no capacity available for a moment, every Pod will eventually be scheduled. In our environment, revisions (temporarily) going into that status are confusing our users.

What I would like to ask for is a configuration option to turn that code path off when the flag is active and the condition's reason is Unschedulable.

If you agree that such a flag makes sense, I would be willing to PR the change. I just would need guidance on how to name the configuration option (pod-is-always-schedulable for example ?) and whether that would go into config-features or if you prefer an environment variable.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

SaschaSchwarze0 commented 4 months ago

/remove-lifecycle stale

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

SaschaSchwarze0 commented 1 month ago

/remove-lifecycle stale

As long as there is no feedback from the Knative community at all, one could consider it offending for the issue to be considered stale. :-)

skonto commented 3 weeks ago

Hi @SaschaSchwarze0!

In our environment, revisions (temporarily) going into that status are confusing our users.

Does that mean that users see the actual revision and get confused? Could the managed service communicate that this is temporary as a workaround? If users see the revision resource, I understand that hiding the temporary error is more friendly I guess because otherwise you are leaking that your just provisioning resources vs resources are always there.

SaschaSchwarze0 commented 3 weeks ago

Hi @SaschaSchwarze0!

In our environment, revisions (temporarily) going into that status are confusing our users.

Does that mean that users see the actual revision and get confused? Could the managed service communicate that this is temporary as a workaround? If users see the revision resource, I understand that hiding the temporary error is more friendly I guess because otherwise you are leaking that your just provisioning resources vs resources are always there.

Hi @skonto, we have an indicator for the readiness of a revision (green for ready=true, yellow for ready=unknown, red for ready=false). That's where the user gets worried because once something is not green, that is not good.

And yes, we probably could figure out if the revision is not ready because Knative assumes the pods cannot be scheduled and still indicate this as green. But yeah, it would only be something our UX puts on top of it. If the user goes down to look at the revision on the Kubernetes API, they would still see it as not ready.