Closed Shashankft9 closed 3 years ago
This is indeed a hole in our readiness state. A revision that was once considered Ready will currently never become unready.
Do you think that this known hole also covers the behavior I noticed in the v0.14.0 - the status never reporting unready state when the first request comes on the scaled down to zero ksvc? I can also try using the v0.18.0 to verify, I remember Dave wanted to test this on v0.18.0 here: https://gist.github.com/dprotaso/56c1fd920291eff29cce48f0501732a0
ref: https://knative.slack.com/archives/CA4DNJ9A4/p1606458112460300
yep, same with v0.18.0, just tested the flow. The status remains True
.
I'm wondering about how legitimately we should consider this for a few reasons:
panic
in Go is generally handled by the http server more gracefully IIRC. Generally the term I have used for requests that kill the server is a "query of death".I think that there are two scenarios worth spelling out:
I think that if we can manage 2.
then bad changes rolled out via latestRevision: true
either with or without the new rollout stuff @vagababov has been working on could theoretically be rolled back, which would be a nice protection to offer folks, but we need to work out the details very carefully.
cc @dprotaso
slack thread discussion: https://knative.slack.com/archives/CA4DNJ9A4/p1606458112460300
/area API
Is there more that needs to be done here? @dprotaso , can you put a priority/hint here?
/triage needs-user-input
Hey, I just noticed the label, is there any more info I could provide?
I think Evan was looking for input from myself. I'll surface what I mentioned slack.
We can't automatically distinguish when a runtime failure should gate readiness of the revision. We need hints from the user via readiness/liveness probes.
Matt's description of 2) in my opinion should be managed by some higher level continuous delivery tool. That would not only capture runtime errors (counting HTTP status codes) but also performance regressions (changes in request latency).
I'm going to close this issue out and recommend maybe looking at spinnaker and other CD tools
I made a follow up issue to confirm that if we have the right hints (readiness/liveness failing) that we behave properly.
What version of Knative?
I have tested the behavior in v0.14.0 and v0.19.0
Expected Behavior
When a revision fails, and the pod shows
Error
state due to a bad gateway 502 response for example, this state should be captured in the logs of controller pod as https://github.com/knative/serving/blob/master/pkg/reconciler/configuration/configuration.go#L111 and similarly this should be carried forward to the status of revision, configuration and ksvc CRs through https://github.com/knative/serving/blob/master/pkg/reconciler/configuration/configuration.go#L113 for example.Actual Behavior
I will distinguish the behavior in the two versions I have tested:
v01.14.0 - Here the logs and the status are captured correctly but the problem is that this happens only when the queue-proxy container is ready (I am doubtful if this is the reason but on observation it appears so) - when the ksvc instance is scaled down and the first request goes, even though the response might be 502 and the pod will go into
Error
, it wont report the log or the status of the CRs (RevisionFailed
) and the pod goes from0/2
to1/2
and then intoError
. But after some seconds, when pod the goes back to2/2
with a newuser-container
(note: ksvc hasnt scaled down to zero) and I send a request, the pod goes intoError
and both the logs and status of the CRs are captured correctly. So essentially, it won't report the failure for the first time and till the pod goes to2/2
stage for the first time.v0.19.0 - Here even the above behavior is absent, I can't see the relevant log and the status stays
True
for all the CRs at all times.Steps to Reproduce the Problem
unmarshalling
because of the variableresult
having a wrong data type: