Knative reconcile fails in-middle, but be `ready` eventually for service creation

ShiningTian commented 3 years ago

In what area(s)?

/area API

What version of Knative?

0.18.x 0.19.x

Expected Behavior

What we observed

We observed multiple knative intermittents reconcile failure, that is the controller met some problems in-middle, and mark the knative service status as "Failed". But given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.

Anyway, these intermittent errors gave the end-user a lot of confusion when using cli tool with the default "wait" approach, since the cli tool will report the fail whenever the "ready=false" detected (as reported in cli issue https://github.com/knative/client/issues/1023 as well)

Below are the reconcile failure in-middle issues. We can look at the details, and then discuss further solution:

Intermittent error 1

Error Msg from kn:

>> kn service create <ksvc-name> --image <image-name> ....
Creating service '<ksvc-name>' in namespace ...:

1.014s The Configuration is still working to reflect the latest desired specification.
30.665s Revision creation failed with message: Post "https://172.21.0.1:443/apis/serving.knative.dev/v1/namespaces/48sz6i7t4tc/revisions": http2: server sent GOAWAY and closed the connection; LastStreamID=355, ErrCode=NO_ERROR, debug="".
30.759s Configuration "<ksvc-name>" does not have any ready Revision.
31.539s Configuration "<ksvc-name>" is waiting for a Revision to become ready.
Error: RevisionFailed: Revision creation failed with message: Post "https://172.21.0.1:443/apis/serving.knative.dev/v1/namespaces/48sz6i7t4tc/revisions": http2: server sent GOAWAY and closed the connection; LastStreamID=355, ErrCode=NO_ERROR, debug="".
Run 'kn --help' for usage

Key error detected in reconcile is the failure when posting to webhook.

'CreationFailed' Failed to create Revision: Internal error occurred: failed calling webhook \"webhook.serving.knative.dev\": Post https://webhook.knative-serving.svc:443/defaulting?timeout=10s: context deadline exceeded"

After we got the error from kn, if we wait for a while, the final status is ready.

kubectl get ksvc <ksvc-name> -o json
{
"address": {
"url": "http://<ksvc-name>.48sz6i7t4tc.svc.cluster.local"
},
"conditions": [
{
  "lastTransitionTime": "2021-01-03T16:08:41Z",
  "status": "True",
  "type": "ConfigurationsReady"
},
{
  "lastTransitionTime": "2021-01-03T16:08:43Z",
  "status": "True",
  "type": "Ready"
},
{
  "lastTransitionTime": "2021-01-03T16:08:43Z",
  "status": "True",
  "type": "RoutesReady"
}
],
"latestCreatedRevisionName": "<ksvc-name>-vxgph-1",
"latestReadyRevisionName": "<ksvc-name>-vxgph-1",
"observedGeneration": 1,
"traffic": [
{
  "latestRevision": true,
  "percent": 100,
  "revisionName": "<ksvc-name>-vxgph-1"
}
],
"url": "http://<ksvc-name>.48sz6i7t4tc.test.xxx.xxx"
}

Intermittent error 2

Error Msg:

kn service create <ksvc-name> --image <image-name>  
Creating service '<ksvc-name>' in namespace ...:

1.405s The Configuration is still working to reflect the latest desired specification.
6.503s The Route is still working to reflect the latest desired specification.
12.083s ...
15.248s Configuration "<ksvc-name>" is waiting for a Revision to become ready.
6.504s ...
9.279s Ingress has not yet been reconciled.
58.913s Ingress reconciliation failed
Error: ReconcileIngressFailed: Ingress reconciliation failed
Run 'kn --help' for usage

As reported in case 1, we observed the webhook connection failure as well in logs.

failed to create VirtualService: Internal error occurred: failed to call webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: context deadline exceeded

After we got the above error from kn, if we wait for a while, the final status is ready.

Intermittent error 3

Error Msg:

kn service create <ksvc-name>  --image <image-name> 
Creating service '<ksvc-name>' in namespace ...:

0.234s The Configuration is still working to reflect the latest desired specification.
0.235s Revision creation failed with message: request declared a Content-Length of 1546 but only wrote 0 bytes.
0.289s The Route is still working to reflect the latest desired specification.
0.654s Configuration "<ksvc-name>" does not have any ready Revision.
1.011s Configuration "<ksvc-name>" is waiting for a Revision to become ready.
Error: RevisionFailed: Revision creation failed with message: request declared a Content-Length of 1546 but only wrote 0 bytes.
Run 'kn --help' for usage

Below is the related logs. I can't tell the exact error, but it still looks like a network issue:

Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller debug {"level":"debug","ts":"2021-01-01T14:33:00.116Z","logger":"controller.service-controller.knative.dev-serving-pkg-reconciler-service.Reconciler","caller":"service/reconciler.go:330","msg":"Updating status with:   v1.ServiceStatus{\n  \tStatus: v1.Status{\n- \t\tObservedGeneration: 0,\n+ \t\tObservedGeneration: 1,\n- \t\tConditions:         nil,\n+ \t\tConditions: v1.Conditions{\n+ \t\t\t{\n+ \t\t\t\tType:               \"ConfigurationsReady\",\n+ \t\t\t\tStatus:             \"Unknown\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.115982123 +0\"...},\n+ \t\t\t\tReason:             \"OutOfDate\",\n+ \t\t\t\tMessage:            \"The Configuration is still working to reflect the latest desired\"...,\n+ \t\t\t},\n+ \t\t\t{\n+ \t\t\t\tType:               \"Ready\",\n+ \t\t\t\tStatus:             \"Unknown\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.115982123 +0\"...},\n+ \t\t\t\tReason:             \"OutOfDate\",\n+ \t\t\t\tMessage:            \"The Configuration is still working to reflect the latest desired\"...,\n+ \t\t\t},\n+ \t\t\t{\n+ \t\t\t\tType:               \"RoutesReady\",\n+ \t\t\t\tStatus:             \"Unknown\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.115982123 +0\"...},\n+ \t\t\t},\n+ \t\t},\n  \t\tAnnotations: nil,\n  \t},\n  \tConfigurationStatusFields: {},\n  \tRouteStatusFields:         {},\n  }\n","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"service-controller","knative.dev/traceid":"fd00bc08-8f07-4793-af94-4813199901ff","knative.dev/key":"2y8lduk2d75/<ksvc-name>"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller info {"level":"info","ts":"2021-01-01T14:33:00.117Z","logger":"controller.service-controller.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"2y8lduk2d75\", Name:\"<ksvc-name>\", UID:\"433d081b-75d3-415b-94aa-ba277c7fe7e3\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"362985101\", FieldPath:\"\"}): type: 'Normal' reason: 'Created' Created Configuration \"<ksvc-name>\"","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"service-controller"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller debug {"level":"debug","ts":"2021-01-01T14:33:00.135Z","logger":"controller.configuration-controller.knative.dev-serving-pkg-reconciler-configuration.Reconciler","caller":"configuration/reconciler.go:330","msg":"Updating status with:   v1.ConfigurationStatus{\n  \tStatus: v1.Status{\n- \t\tObservedGeneration: 0,\n+ \t\tObservedGeneration: 1,\n- \t\tConditions:         nil,\n+ \t\tConditions: v1.Conditions{\n+ \t\t\t{\n+ \t\t\t\tType:               \"Ready\",\n+ \t\t\t\tStatus:             \"False\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.134904845 +0\"...},\n+ \t\t\t\tReason:             \"RevisionFailed\",\n+ \t\t\t\tMessage:            \"Revision creation failed with message: request declared a Conten\"...,\n+ \t\t\t},\n+ \t\t},\n  \t\tAnnotations: nil,\n  \t},\n  \tConfigurationStatusFields: {},\n  }\n","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"configuration-controller","knative.dev/traceid":"794d6e85-0b19-49e1-9957-41634f6276d5","knative.dev/key":"2y8lduk2d75/<ksvc-name>"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller info {"level":"info","ts":"2021-01-01T14:33:00.135Z","logger":"controller.configuration-controller.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"Configuration\", Namespace:\"2y8lduk2d75\", Name:\"<ksvc-name>\", UID:\"9a9913af-fee1-42f4-9769-b57f2b29c384\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"362985103\", FieldPath:\"\"}): type: 'Warning' reason: 'CreationFailed' Failed to create Revision: request declared a Content-Length of 1546 but only wrote 0 bytes","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"configuration-controller"} 
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller debug {"level":"debug","ts":"2021-01-01T14:33:00.176Z","logger":"controller.service-controller.knative.dev-serving-pkg-reconciler-service.Reconciler","caller":"service/service.go:79","msg":"Configuration Conditions = v1.Conditions{apis.Condition{Type:\"Ready\", Status:\"False\", Severity:\"\", LastTransitionTime:apis.VolatileTime{Inner:v1.Time{Time:time.Time{wall:0x0, ext:63745108380, loc:(*time.Location)(0x3217da0)}}}, Reason:\"RevisionFailed\", Message:\"Revision creation failed with message: request declared a Content-Length of 1546 but only wrote 0 bytes.\"}}","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"service-controller","knative.dev/traceid":"6a78d5c7-ac6c-44a8-b48e-02e1170194a9","knative.dev/key":"2y8lduk2d75/<ksvc-name>"}
Ja
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller error {"level":"error","ts":"2021-01-01T14:33:00.178Z","logger":"controller.configuration-controller.knative.dev-serving-pkg-reconciler-configuration.Reconciler","caller":"configuration/reconciler.go:302","msg":"Returned an error","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"configuration-controller","knative.dev/traceid":"794d6e85-0b19-49e1-9957-41634f6276d5","knative.dev/key":"2y8lduk2d75/<ksvc-name>","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"failed to create Revision: request declared a Content-Length of 1546 but only wrote 0 bytes","stacktrace":"knative.dev/serving/pkg/client/injection/reconciler/serving/v1/configuration.(*reconcilerImpl).Reconcile\n\tknative.dev/serving/pkg/client/injection/reconciler/serving/v1/configuration/reconciler.go:302\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20201103163404-5514ab0c1fdf/controller/controller.go:513\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20201103163404-5514ab0c1fdf/controller/controller.go:451"}

After we got the above error from kn, if we wait for a while, the final status is ready.

Summary

all of these 3 errors are related to network intermittent connection errors. As in a cloud environment, it is the nature that the network is not reliable all the time.

Given knative will do retry to tolerate the network issues, can we think out a better approach to facilitate the user experience from the serving side and client cmd to avoid the in-middle creation failure invisible to the end-user?

cdlliuy commented 3 years ago

@ShiningTian and I work in the same team, so we have similar observations on the knative creation failure.

Besides the mentioned in-middle reconcile failure due to network issues,
I also reported

https://github.com/knative/serving/issues/10076 for the pod creation failure due to scheduler preemption-eviction, which finally get ksvc ready as well.

On the other hand, I also reported

https://github.com/knative/serving/issues/9857 which talks about the lower-level pod creation failure due to limitrange didn't be pickup by revision controller sometimes, so it kepts waiting for pod to be ready until Initial scale was never achieved.

I would like to propose to work on a solution by considering all these 3 issues together to improve the user experience.

Below is my proposal for the serving side changes:

From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready.

That means, for the network error mentioned above and the pod schedule issue in #10076, ksvc controller will collect the reconcile failure, and update the Revision/Routes CR for the error.
BUT ksvc controller won't set ksvc CR status.ready to false.
As a result, the ksvc service CR will only be True or Unknown if the error can't recover with deadline exceeded. If the final state is unknown, the end-user can check routes or revision CR to understand what exactly wrong, which can fulfill the situation in #9857 as well.

If it is not acceptable for the serving side, then we need to discuss for the kn site. Also, in knative client-side, I reported

https://github.com/knative/client/issues/1023 to discuss whether it is possible to add a toleration error window configuration, so that let knative client to wait for another XX seconds (per kn cmd options) if any ksvc ready==false detected.

cdlliuy commented 3 years ago

@julz @markusthoemmes @mattmoor ^^ :-)

mattmoor commented 3 years ago

I think the assumption here is that it is an unrecoverable failure: https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76

cc @dprotaso

dprotaso commented 3 years ago

/area API

given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.

There's currently some gaps in our failure modes as you pointed out (https://github.com/knative/serving/issues/9857, https://github.com/knative/serving/issues/10076) where we don't report certain failures and revisions will end up being marked 'ready=True only when they are scaled to 0. I've been poking at this problem prior to the break.

The scenarios you highlighted are a bit different

'CreationFailed' Failed to create Revision: Internal error occurred: failed calling webhook \"webhook.serving.knative.dev\": Post https://webhook.knative-serving.svc:443/defaulting?timeout=10s: context deadline exceeded" failed to create VirtualService: Internal error occurred: failed to call webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: context deadline exceeded

Some of the errors you highlighted are failures with the K8s API invoking webooks. Was there anything abnormal with your knative & istio webhook deployments?

From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready. ... I think the assumption here is that it is an unrecoverable failure:

We don't annotate creation errors as permanent but we should depending on the status - ie. if the API returns 4xx and potentially consider retries without marking the failure when a 5xx occurs

cdlliuy commented 3 years ago

@dprotaso, yes, I agreed that there are retries, and I like the retries in current design , which make the ksvc eventually to be ready.

I just wonder whether we can avoid to report reconcile failure since these are not permanent errors.

evankanderson commented 3 years ago

I think the assumption here is that it is an unrecoverable failure:

https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76

cc @dprotaso

It looks like the request here would be to support a few rounds of retries before failing this (?)

/good-first-issue /triage accepted

knative-prow-robot commented 3 years ago

@evankanderson: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/knative/serving/issues/10511): >> I think the assumption here is that it is an unrecoverable failure: >> >> https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76 >> >> cc @dprotaso > >It looks like the request here would be to support a few rounds of retries before failing this (?) > >/good-first-issue >/triage accepted Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

cdlliuy commented 3 years ago

I would like to contribute to this. Will read through more code logics on this and discuss the solution before go further.

evankanderson commented 3 years ago

When you're ready, you can /assign to assign the issue to yourself to let people know you're working on it.

benja-M-1 commented 3 years ago

Hello, we are experiencing the same kind of issue on Google Cloud Run. Do you have a workaround until a fix is found? I would be glad to help but not sure where to start nor what to do :/ Thanks you for your help 🙏🏻

skonto commented 3 years ago

@cdlliuy Hi! are you planning to work on this?

yuvaldolev commented 3 years ago

Hi @ShiningTian @evankanderson! Can I claim this issue if no one else is working on it?

evankanderson commented 3 years ago

The /assign command on is own line in a comment will assign the bug to the person making the comment. (You can also /assign <user> if you want.)

This bug looks open for you to tackle. 😉

yuvaldolev commented 3 years ago

/assign

yuvaldolev commented 3 years ago

Thanks @evankanderson!

skonto commented 2 years ago

@yuvaldolev gentle ping, any progress on this?

skonto commented 2 years ago

@yuvaldolev need any help to push this one forward?

dprotaso commented 1 year ago

/unassign @yuvaldolev due to inactivity

vishal-chdhry commented 1 year ago

/assign

vishal-chdhry commented 1 year ago

I can't figure out what I have to do, so I am unassigning myself

vishal-chdhry commented 1 year ago

/unassign

maryfrances01 commented 1 year ago

/assign

knative / serving