Open ShiningTian opened 3 years ago
@ShiningTian and I work in the same team, so we have similar observations on the knative creation failure.
Besides the mentioned in-middle reconcile failure due to network issues,
I also reported
On the other hand, I also reported
limitrange
didn't be pickup by revision
controller sometimes, so it kepts waiting for pod to be ready until Initial scale was never achieved.I would like to propose to work on a solution by considering all these 3 issues together to improve the user experience.
Below is my proposal for the serving side changes:
From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready.
That means, for the network error mentioned above and the pod schedule issue in #10076, ksvc controller will collect the reconcile failure, and update the Revision/Routes CR for the error.
BUT ksvc controller won't set ksvc CR status.ready
to false.
As a result, the ksvc service CR will only be True or Unknown if the error can't recover with deadline exceeded.
If the final state is unknown
, the end-user can check routes
or revision
CR to understand what exactly wrong, which can fulfill the situation in #9857 as well.
If it is not acceptable for the serving side, then we need to discuss for the kn
site.
Also, in knative client-side, I reported
ksvc ready==false
detected. @julz @markusthoemmes @mattmoor ^^ :-)
I think the assumption here is that it is an unrecoverable failure: https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76
cc @dprotaso
/area API
given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.
There's currently some gaps in our failure modes as you pointed out (https://github.com/knative/serving/issues/9857, https://github.com/knative/serving/issues/10076) where we don't report certain failures and revisions will end up being marked 'ready=True
only when they are scaled to 0
. I've been poking at this problem prior to the break.
The scenarios you highlighted are a bit different
'CreationFailed' Failed to create Revision: Internal error occurred: failed calling webhook \"webhook.serving.knative.dev\": Post https://webhook.knative-serving.svc:443/defaulting?timeout=10s: context deadline exceeded" failed to create VirtualService: Internal error occurred: failed to call webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: context deadline exceeded
Some of the errors you highlighted are failures with the K8s API invoking webooks. Was there anything abnormal with your knative & istio webhook deployments?
From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready. ... I think the assumption here is that it is an unrecoverable failure:
We don't annotate creation errors as permanent but we should depending on the status - ie. if the API returns 4xx and potentially consider retries without marking the failure when a 5xx occurs
@dprotaso, yes, I agreed that there are retries, and I like the retries in current design , which make the ksvc eventually to be ready.
I just wonder whether we can avoid to report reconcile failure
since these are not permanent errors.
I think the assumption here is that it is an unrecoverable failure:
cc @dprotaso
It looks like the request here would be to support a few rounds of retries before failing this (?)
/good-first-issue /triage accepted
@evankanderson: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
I would like to contribute to this. Will read through more code logics on this and discuss the solution before go further.
When you're ready, you can /assign
to assign the issue to yourself to let people know you're working on it.
Hello, we are experiencing the same kind of issue on Google Cloud Run. Do you have a workaround until a fix is found? I would be glad to help but not sure where to start nor what to do :/ Thanks you for your help ππ»
@cdlliuy Hi! are you planning to work on this?
Hi @ShiningTian @evankanderson! Can I claim this issue if no one else is working on it?
The /assign
command on is own line in a comment will assign the bug to the person making the comment. (You can also /assign <user>
if you want.)
This bug looks open for you to tackle. π
/assign
Thanks @evankanderson!
@yuvaldolev gentle ping, any progress on this?
@yuvaldolev need any help to push this one forward?
/unassign @yuvaldolev due to inactivity
/assign
I can't figure out what I have to do, so I am unassigning myself
/unassign
/assign
In what area(s)?
What version of Knative?
Expected Behavior
What we observed
We observed multiple knative intermittents reconcile failure, that is the controller met some problems in-middle, and mark the knative service status as "Failed". But given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.
Anyway, these intermittent errors gave the end-user a lot of confusion when using cli tool with the default "wait" approach, since the cli tool will report the fail whenever the "ready=false" detected (as reported in cli issue https://github.com/knative/client/issues/1023 as well)
Below are the reconcile failure in-middle issues. We can look at the details, and then discuss further solution:
Intermittent error 1
Error Msg from
kn
:Key error detected in reconcile is the failure when posting to webhook.
After we got the error from
kn
, if we wait for a while, the final status is ready.Intermittent error 2
Error Msg:
As reported in case 1, we observed the webhook connection failure as well in logs.
kn
, if we wait for a while, the final status is ready.Intermittent error 3
Error Msg:
Below is the related logs. I can't tell the exact error, but it still looks like a network issue:
After we got the above error from
kn
, if we wait for a while, the final status is ready.Summary
all of these 3 errors are related to network intermittent connection errors. As in a cloud environment, it is the nature that the network is not reliable all the time.
Given knative will do retry to tolerate the network issues, can we think out a better approach to facilitate the user experience from the serving side and client cmd to avoid the in-middle
creation failure
invisible to the end-user?