knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.55k stars 1.16k forks source link

Knative reconcile fails in-middle, but be `ready` eventually for service creation #10511

Open ShiningTian opened 3 years ago

ShiningTian commented 3 years ago

In what area(s)?

/area API

What version of Knative?

0.18.x 0.19.x

Expected Behavior

What we observed

We observed multiple knative intermittents reconcile failure, that is the controller met some problems in-middle, and mark the knative service status as "Failed". But given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.

Anyway, these intermittent errors gave the end-user a lot of confusion when using cli tool with the default "wait" approach, since the cli tool will report the fail whenever the "ready=false" detected (as reported in cli issue https://github.com/knative/client/issues/1023 as well)

Below are the reconcile failure in-middle issues. We can look at the details, and then discuss further solution:

Intermittent error 1
Intermittent error 2
failed to create VirtualService: Internal error occurred: failed to call webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: context deadline exceeded
Intermittent error 3

Summary

all of these 3 errors are related to network intermittent connection errors. As in a cloud environment, it is the nature that the network is not reliable all the time.

Given knative will do retry to tolerate the network issues, can we think out a better approach to facilitate the user experience from the serving side and client cmd to avoid the in-middle creation failure invisible to the end-user?

cdlliuy commented 3 years ago

@ShiningTian and I work in the same team, so we have similar observations on the knative creation failure.

Besides the mentioned in-middle reconcile failure due to network issues,
I also reported

On the other hand, I also reported

I would like to propose to work on a solution by considering all these 3 issues together to improve the user experience.

Below is my proposal for the serving side changes:

From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready.

That means, for the network error mentioned above and the pod schedule issue in #10076, ksvc controller will collect the reconcile failure, and update the Revision/Routes CR for the error.
BUT ksvc controller won't set ksvc CR status.ready to false.
As a result, the ksvc service CR will only be True or Unknown if the error can't recover with deadline exceeded. If the final state is unknown, the end-user can check routes or revision CR to understand what exactly wrong, which can fulfill the situation in #9857 as well.

If it is not acceptable for the serving side, then we need to discuss for the kn site. Also, in knative client-side, I reported

cdlliuy commented 3 years ago

@julz @markusthoemmes @mattmoor ^^ :-)

mattmoor commented 3 years ago

I think the assumption here is that it is an unrecoverable failure: https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76

cc @dprotaso

dprotaso commented 3 years ago

/area API

given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.

There's currently some gaps in our failure modes as you pointed out (https://github.com/knative/serving/issues/9857, https://github.com/knative/serving/issues/10076) where we don't report certain failures and revisions will end up being marked 'ready=True only when they are scaled to 0. I've been poking at this problem prior to the break.

The scenarios you highlighted are a bit different

'CreationFailed' Failed to create Revision: Internal error occurred: failed calling webhook \"webhook.serving.knative.dev\": Post https://webhook.knative-serving.svc:443/defaulting?timeout=10s: context deadline exceeded" failed to create VirtualService: Internal error occurred: failed to call webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: context deadline exceeded

Some of the errors you highlighted are failures with the K8s API invoking webooks. Was there anything abnormal with your knative & istio webhook deployments?

From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready. ... I think the assumption here is that it is an unrecoverable failure:

We don't annotate creation errors as permanent but we should depending on the status - ie. if the API returns 4xx and potentially consider retries without marking the failure when a 5xx occurs

cdlliuy commented 3 years ago

@dprotaso, yes, I agreed that there are retries, and I like the retries in current design , which make the ksvc eventually to be ready.

I just wonder whether we can avoid to report reconcile failure since these are not permanent errors.

evankanderson commented 3 years ago

I think the assumption here is that it is an unrecoverable failure:

https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76

cc @dprotaso

It looks like the request here would be to support a few rounds of retries before failing this (?)

/good-first-issue /triage accepted

knative-prow-robot commented 3 years ago

@evankanderson: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/knative/serving/issues/10511): >> I think the assumption here is that it is an unrecoverable failure: >> >> https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76 >> >> cc @dprotaso > >It looks like the request here would be to support a few rounds of retries before failing this (?) > >/good-first-issue >/triage accepted Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cdlliuy commented 3 years ago

I would like to contribute to this. Will read through more code logics on this and discuss the solution before go further.

evankanderson commented 3 years ago

When you're ready, you can /assign to assign the issue to yourself to let people know you're working on it.

benja-M-1 commented 3 years ago

Hello, we are experiencing the same kind of issue on Google Cloud Run. Do you have a workaround until a fix is found? I would be glad to help but not sure where to start nor what to do :/ Thanks you for your help πŸ™πŸ»

skonto commented 3 years ago

@cdlliuy Hi! are you planning to work on this?

yuvaldolev commented 3 years ago

Hi @ShiningTian @evankanderson! Can I claim this issue if no one else is working on it?

evankanderson commented 3 years ago

The /assign command on is own line in a comment will assign the bug to the person making the comment. (You can also /assign <user> if you want.)

This bug looks open for you to tackle. πŸ˜‰

yuvaldolev commented 3 years ago

/assign

yuvaldolev commented 3 years ago

Thanks @evankanderson!

skonto commented 2 years ago

@yuvaldolev gentle ping, any progress on this?

skonto commented 2 years ago

@yuvaldolev need any help to push this one forward?

dprotaso commented 1 year ago

/unassign @yuvaldolev due to inactivity

vishal-chdhry commented 1 year ago

/assign

vishal-chdhry commented 1 year ago

I can't figure out what I have to do, so I am unassigning myself

vishal-chdhry commented 1 year ago

/unassign

maryfrances01 commented 1 year ago

/assign