knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.56k stars 1.16k forks source link

Revision gets stuck in ResolvingDigests step #10259

Closed skonto closed 3 years ago

skonto commented 3 years ago

What version of Knative?

0.19.0 (release)

Expected Behavior

Install a basic service and Revision should be ready.

Actual Behavior

Revision gets stuck:

Name: event-display-00001
Namespace: default
Labels: serving.knative.dev/configuration=event-display
serving.knative.dev/configurationGeneration=1
serving.knative.dev/routingState=active
serving.knative.dev/service=event-display
Annotations: serving.knative.dev/creator: system:admin
serving.knative.dev/routes: event-display
serving.knative.dev/routingStateModified: 2020-12-02T16:01:15Z
API Version: serving.knative.dev/v1
Kind: Revision
Metadata:
Creation Timestamp: 2020-12-02T16:01:15Z
Generation: 1
Managed Fields:
API Version: serving.knative.dev/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:serving.knative.dev/creator:
f:serving.knative.dev/routes:
f:serving.knative.dev/routingStateModified:
f:labels:
.:
f:serving.knative.dev/configuration:
f:serving.knative.dev/configurationGeneration:
f:serving.knative.dev/routingState:
f:serving.knative.dev/service:
f:ownerReferences:
.:
k:
Unknown macro: {"uid"}
:
.:
f:apiVersion:
f:blockOwnerDeletion:
f:controller:
f:kind:
f:name:
f:uid:
f:spec:
.:
f:containerConcurrency:
f:containers:
f:enableServiceLinks:
f:timeoutSeconds:
f:status:
.:
f:conditions:
f:observedGeneration:
Manager: controller
Operation: Update
Time: 2020-12-02T16:01:15Z
Owner References:
API Version: serving.knative.dev/v1
Block Owner Deletion: true
Controller: truehttps://knative.dev/docs/install/any-kubernetes-cluster​_
Kind: Configuration
Name: event-display
UID: 80261e6a-b302-41db-98a9-b0cef3f5b187
Resource Version: 46845
Self Link: /apis/serving.knative.dev/v1/namespaces/default/revisions/event-display-00001
UID: b7ae07f2-7f50-47ec-bd5b-0c2757cab632
Spec:
Container Concurrency: 0
Containers:
Image: quay.io/openshift-knative/knative-eventing-sources-event-display
Name: user-container
Readiness Probe:
Success Threshold: 1
Tcp Socket:
Port: 0
Resources:
Enable Service Links: false
Timeout Seconds: 300
Status:
Conditions:
Last Transition Time: 2020-12-02T16:01:15Z
Status: Unknown
Type: ContainerHealthy
Last Transition Time: 2020-12-02T16:01:15Z
Reason: ResolvingDigests
Status: Unknown
Type: Ready
Last Transition Time: 2020-12-02T16:01:15Z
Reason: ResolvingDigests
Status: Unknown
Type: ResourcesAvailable
Observed Generation: 1
Events: <none>

Steps to Reproduce the Problem

  1. A cluster where GCE metatdata server is not working eg. DNS issues or something else is not working but it is irrelevant as infra things happen all the time. Not sure how to create that infra issue but in my case it is OCP 4.6.6 on top of GCE, internally created.
  2. Install Serving from 0.19.0 release and Kourier.
  3. Install a basic service eg.:
    
    # This is a very simple Knative Service that writes the input request to its log.
    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
    name: event-display
    spec:
    template:
    spec:
      containers:
        - # This corresponds to
          # https://github.com/knative/eventing-contrib/tree/master/cmd/event_display/main.go
          image: gcr.io/knative-releases/knative.dev/eventing-contrib/cmd/event_display

I have debugged the issue and the problem is that image digest resolution gets stuck.

Here is the story (added some log statements for tracking execution progress, a bit ugly sorry for that):

The Serving controller log is here (https://gist.github.com/skonto/d7c3753cfdb711bf87ce19ba0a3c166a), never passes this (https://github.com/skonto/serving/blob/debug-stuck-r/vendor/github.com/vdemeester/k8s-pkg-credentialprovider/gcp/metadata.go#L218).

In detail here is what happens. Revision resolve calls  `kc, err := k8schain.New(ctx, r.client, opt)`  (https://github.com/skonto/serving/blob/70ce722d1b2a70de690b8059a9429c823fd3eeaf/pkg/reconciler/revision/resolve.go#L81)

which calls `keyring = credentialprovider.NewDockerKeyring()` (https://github.com/skonto/serving/blob/70ce722d1b2a70de690b8059a9429c823fd3eeaf/vendor/github.com/google/go-containerregistry/pkg/authn/k8schain/k8schain.go#L104)

and blocks when checking if the gcp provider is enabled (https://github.com/skonto/serving/blob/70ce722d1b2a70de690b8059a9429c823fd3eeaf/vendor/github.com/vdemeester/k8s-pkg-credentialprovider/plugins.go#L65).

The problem is that this blocking behavior is documented but we missed it (https://github.com/skonto/serving/blob/70ce722d1b2a70de690b8059a9429c823fd3eeaf/vendor/github.com/vdemeester/k8s-pkg-credentialprovider/provider.go#L28):

// DockerConfigProvider is the interface that registered extensions implement // to materialize 'dockercfg' credentials. type DockerConfigProvider interface { // Enabled returns true if the config provider is enabled. // Implementations can be blocking - e.g. metadata server unavailable. Enabled() bool // Provide returns docker configuration. // Implementations can be blocking - e.g. metadata server unavailable. // The image is passed in as context in the event that the // implementation depends on information in the image name to return // credentials; implementations are safe to ignore the image. Provide(image string) DockerConfig }



So the GCP provider implements Enabled() (https://github.com/skonto/serving/blob/70ce722d1b2a70de690b8059a9429c823fd3eeaf/vendor/github.com/vdemeester/k8s-pkg-credentialprovider/gcp/metadata.go#L212)

which calls runWithBackoff(func() ([]byte, error) (https://github.com/skonto/serving/blob/70ce722d1b2a70de690b8059a9429c823fd3eeaf/vendor/github.com/vdemeester/k8s-pkg-credentialprovider/gcp/metadata.go#L218).

Unfortunately this blocks forever, this is also stated in the implementation:

// runWithBackoff runs input function `f` with an exponential backoff.
// Note that this method can block indefinitely.

(https://github.com/skonto/serving/blob/70ce722d1b2a70de690b8059a9429c823fd3eeaf/vendor/github.com/vdemeester/k8s-pkg-credentialprovider/gcp/metadata.go#L184)

Proposed solution:

Update: Enhance external deps with ctx that can be canceled.

Another option is to build without the providers if it fits the use case eg. use `disable_gcp`.
julz commented 3 years ago

wondering if the ctx in k8schain.New(ctx, r.client, opt) implies we might be able to have k8schain respect the timeout? The background resolve already sets up context.WithTimeout, so if k8schain were able to respect that (like go-containerregistry already does as of https://github.com/knative/serving/pull/8724) I think everything should work correctly..

skonto commented 3 years ago

@julz I guess this has to be implemented in that repo and pass a ctx down to the actual http call which resides in another repo. That means some work on those repos, do you think we should do a quick fix with a goroutine instead?

markusthoemmes commented 3 years ago

Does running it in a goroutine buy us anything? If it blocks forever, we'll leak the goroutine, right? AFAIK, there's no way to abort a goroutine.

skonto commented 3 years ago

@markusthoemmes Afaik it is not blocked on a system call its running with backoff forever, check here. If we set a timeout on the goroutine we can wait for it and move on, check the example. To be sure about this we need to detect if error is get returned anyway here.

markusthoemmes commented 3 years ago

@skonto in precisely that example, the goroutine leaks and still runs forever, even if the program moved on. We'll have to thread the context.Context into the infinitely-running-function and make it return ctx.Err() once ctx.Done() has an event.

skonto commented 3 years ago

Ok now I see what you mean then passing ctx is inevitable, or maybe add a configuration for the maximum backoff so it waits for that if set and avoid ctx passing completely. The assumption in that code that at some point the metadata server will return something is not true in all cases.

mattmoor commented 3 years ago

Is ctx insufficiently plumbed through ggcr? What do we want to do here?

I'm going to assign to @julz for now.

/assign @julz

cc @jonjohnsonjr for ggcr help

jonjohnsonjr commented 3 years ago

This seems mostly like an issue in credentialprovider/gcp? And we're blocking here, right?

https://github.com/google/go-containerregistry/blob/d1ffc8b82b61910dbcabfb07fdc0b6cb96fb3d66/pkg/authn/k8schain/k8schain.go#L94

We could hack around it in k8schain by doing that in a goroutine with a timeout, but I don't see a way to actually use k8schain's built-in stuff if that blocks forever. We could return an error, or just use the pull secrets, but ideally credential providers would accept a ctx and respect cancellation.

I think best thing we could do (for now) is fail in k8schain, and suggest using a different release with specific credential providers disabled?

Edit: having read through this, that's exactly what's proposed. My worry with just letting the goroutine expire and moving on is that it will be ~impossible to debug that this happened. If we can surface this state somehow in the status if we encounter an auth error, that would be helpful. This could be on the status of a ksvc/revision or even on the controller's status. Just something to indicate that we are operating in degraded state.

I'd bias towards failing ASAP, but I understand that might be a terrible UX.

skonto commented 3 years ago

@jonjohnsonjr Yes that is the point. The best option is to patch the external libs to propagate a ctx that timeouts and pass it down to the runWithBackof call. Reason is that credentialprovider.ReadUrl, a few lines bellow, does the actual failing http request and can be easily enhanced with a ctx. Then we should break the loop if ctx expires or we could configure that call to break the loop somehow given some timeout. Regarding using a goroutine does not that suffer from the leak issue discussed above (goroutine runs forever)? I propose we fix this once and for all properly.

jonjohnsonjr commented 3 years ago

Given https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/541-external-credential-providers we'd diverge pretty significantly from upstream k8s. That seems fine, if @vdemeester is willing to maintain that fork, but it might be less effort to implement k8schain in terms of external credential providers. I'm not sure if that's feasible, as I haven't fully digested the KEP, but it's worth investigating.

Perhaps we wouldn't be able to use the exact same mechanism that's exposed to kube, but I'd guess it would be possible to manually link in the implementations to satisfy k8schain.

Personally, I'd like to see k8schain deprecated and something else spun up (under knative org, perhaps) that serves the same purpose without using the credentialprovider packages.

jonjohnsonjr commented 3 years ago

At least for GCP, this seems very interesting: https://github.com/kubernetes/cloud-provider-gcp/issues/173

Miles-Ahead-Digital commented 3 years ago

@skonto I run into exactly the same problem.

Maybe you can help me to solve to get the ResolvingDigest Problem helping me to adjust my Infrastructure.

My Cluster is setup conforming to: https://blog.atomist.com/kubernetes-ingress-nginx-cert-manager-external-dns/

I added Istio, which works great (I added the istio-ingress to the autoDNS). Then I installed Knative serving (no DNS config so far because I wanted to to test with the curl workaround first).

The helloworld-go example (with the image from the public registry) fails whith the ResolvingDigest Problem...

Maybe I have to bind ClusterRoles from Knative to the the IAM ServiceAccounts?? Any help would be appreciated!!

Stefan PS: I uploaded the image to the private docker image in the GKE Project, but the error is stil the same..

skonto commented 3 years ago

Hi @Miles-Ahead-Digital did you follow the steps here? I think the problem you are facing is different because in my case Serving was launched on AWS and due to the image resolution library bug the revision got stuck.

Miles-Ahead-Digital commented 3 years ago

Hi @skonto yes its something with the (secure) setup described in this link. In a Standard GKE setup it works.

It does not work in my setup although the image in the tutorial is public. gcr.io/knative-samples/helloworld-go

What files Logs could help me best to find the reason? Stefan

Miles-Ahead-Digital commented 3 years ago

@skonto

@ddgenome who provided the secure cluster configuration did an analysis:

The GKE Cluster setup in use disables access to most node metadata from within a pod, https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata.

I would guess this check https://github.com/skonto/serving/blob/debug-stuck-r/vendor/github.com/vdemeester/k8s-pkg-credentialprovider/gcp/metadata.go#L218 is failing.

Could you provide a patch, so that I can stay with google security recommendations.

Thanks and best regards Stefan

Miles-Ahead-Digital commented 3 years ago

I'm using a GKE Cluster with Workload Identity enabled and I'm wondering if this is the reason. Can you provide documentation how to setup knative on a cluster using Workload Identity?

vagababov commented 3 years ago

cc @JRBANCEL @ZhiminXiang for GKE help.

Miles-Ahead-Digital commented 3 years ago

I found the https://github.com/google/knative-gcp project and will use their GCP specific setup of knative. Therefore you can ignore my problem in this thread, because I will open a ticket in the knative-gcp project..

evankanderson commented 3 years ago

/assign @JRBANCEL

/area API

Is there more left to do here?

/triage needs-user-input

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

evankanderson commented 3 years ago

@jonjohnsonjr it looks like you've been thinking about this. Any further comments?

There's also https://github.com/knative/serving/issues/6895 that looks related; not sure if it makes sense to consolidate the two threads.

jonjohnsonjr commented 3 years ago

I don't think it makes sense to consolidate the two threads, but they might be resolved by the same effort.

Since I last checked, there does appear to be some activity in this general area, but I'm not personally invested in this issue, so it's unlikely that I'll try to put together a solution.

Two potential paths forward, from browsing around a bit:

  1. Look into using github.com/kubernetes/legacy-cloud-providers instead of @vdemeester's fork. These appear to have been rewritten slightly, so perhaps they don't suffer in the same way as the original ones that cause this issue.
  2. Look into using the credentialprovider API, e.g. here is the GCP implementation: https://github.com/kubernetes/cloud-provider-gcp/tree/master/cmd/auth-provider-gcp

I have not looked into this that much, but the second option seems way better from a maintenance perspective. Rather than linking everything directly into the binary, you could communicate over a standard interface, which is all I ever wanted.

On the other hand, this doesn't seem like something they expect workloads to use, just the kubelet: https://kubernetes.io/docs/tasks/kubelet-credential-provider/kubelet-credential-provider/

So it might not be possible to even communicate with the auth providers. Here's where the GCE provider configures it: https://github.com/kubernetes/cloud-provider-gcp/blob/8153caac26ead5b5de9a515ba7b7622a0f7ee844/cluster/gce/gci/configure-helper.sh#L3191-L3208

This package actually implements the credentialprovider interface that k8schain operates on, using the external auth provider: https://github.com/kubernetes/kubernetes/blob/ea0764452222146c47ec826977f49d7001b0ea8c/pkg/credentialprovider/plugin/plugin.go

Ideally, we could just import that and call it a day, but I'm not sure if it's possible, and they don't like it when we import their packages: https://github.com/google/go-containerregistry/issues/496

So... /shrug

@andrewsykim seems to be the one working on this stuff on the k8s side, maybe they have some ideas?

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.