Closed skonto closed 3 years ago
wondering if the ctx
in k8schain.New(ctx, r.client, opt)
implies we might be able to have k8schain respect the timeout? The background resolve already sets up context.WithTimeout, so if k8schain were able to respect that (like go-containerregistry already does as of https://github.com/knative/serving/pull/8724) I think everything should work correctly..
@julz I guess this has to be implemented in that repo and pass a ctx down to the actual http call which resides in another repo. That means some work on those repos, do you think we should do a quick fix with a goroutine instead?
Does running it in a goroutine buy us anything? If it blocks forever, we'll leak the goroutine, right? AFAIK, there's no way to abort a goroutine.
@skonto in precisely that example, the goroutine leaks and still runs forever, even if the program moved on. We'll have to thread the context.Context
into the infinitely-running-function and make it return ctx.Err()
once ctx.Done()
has an event.
Ok now I see what you mean then passing ctx is inevitable, or maybe add a configuration for the maximum backoff so it waits for that if set and avoid ctx passing completely. The assumption in that code that at some point the metadata server will return something is not true in all cases.
Is ctx
insufficiently plumbed through ggcr? What do we want to do here?
I'm going to assign to @julz for now.
/assign @julz
cc @jonjohnsonjr for ggcr help
This seems mostly like an issue in credentialprovider/gcp
? And we're blocking here, right?
We could hack around it in k8schain
by doing that in a goroutine with a timeout, but I don't see a way to actually use k8schain's built-in stuff if that blocks forever. We could return an error, or just use the pull secrets, but ideally credential providers would accept a ctx and respect cancellation.
I think best thing we could do (for now) is fail in k8schain, and suggest using a different release with specific credential providers disabled?
Edit: having read through this, that's exactly what's proposed. My worry with just letting the goroutine expire and moving on is that it will be ~impossible to debug that this happened. If we can surface this state somehow in the status if we encounter an auth error, that would be helpful. This could be on the status of a ksvc/revision or even on the controller's status. Just something to indicate that we are operating in degraded state.
I'd bias towards failing ASAP, but I understand that might be a terrible UX.
@jonjohnsonjr Yes that is the point. The best option is to patch the external libs to propagate a ctx that timeouts and pass it down to the runWithBackof call. Reason is that credentialprovider.ReadUrl, a few lines bellow, does the actual failing http request and can be easily enhanced with a ctx. Then we should break the loop if ctx expires or we could configure that call to break the loop somehow given some timeout. Regarding using a goroutine does not that suffer from the leak issue discussed above (goroutine runs forever)? I propose we fix this once and for all properly.
Given https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/541-external-credential-providers we'd diverge pretty significantly from upstream k8s. That seems fine, if @vdemeester is willing to maintain that fork, but it might be less effort to implement k8schain in terms of external credential providers. I'm not sure if that's feasible, as I haven't fully digested the KEP, but it's worth investigating.
Perhaps we wouldn't be able to use the exact same mechanism that's exposed to kube, but I'd guess it would be possible to manually link in the implementations to satisfy k8schain.
Personally, I'd like to see k8schain deprecated and something else spun up (under knative org, perhaps) that serves the same purpose without using the credentialprovider packages.
At least for GCP, this seems very interesting: https://github.com/kubernetes/cloud-provider-gcp/issues/173
@skonto I run into exactly the same problem.
Maybe you can help me to solve to get the ResolvingDigest Problem helping me to adjust my Infrastructure.
My Cluster is setup conforming to: https://blog.atomist.com/kubernetes-ingress-nginx-cert-manager-external-dns/
I added Istio, which works great (I added the istio-ingress to the autoDNS). Then I installed Knative serving (no DNS config so far because I wanted to to test with the curl workaround first).
The helloworld-go example (with the image from the public registry) fails whith the ResolvingDigest Problem...
Maybe I have to bind ClusterRoles from Knative to the the IAM ServiceAccounts?? Any help would be appreciated!!
Stefan PS: I uploaded the image to the private docker image in the GKE Project, but the error is stil the same..
Hi @Miles-Ahead-Digital did you follow the steps here? I think the problem you are facing is different because in my case Serving was launched on AWS and due to the image resolution library bug the revision got stuck.
Hi @skonto yes its something with the (secure) setup described in this link. In a Standard GKE setup it works.
It does not work in my setup although the image in the tutorial is public. gcr.io/knative-samples/helloworld-go
What files Logs could help me best to find the reason? Stefan
@skonto
@ddgenome who provided the secure cluster configuration did an analysis:
The GKE Cluster setup in use disables access to most node metadata from within a pod, https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata.
I would guess this check https://github.com/skonto/serving/blob/debug-stuck-r/vendor/github.com/vdemeester/k8s-pkg-credentialprovider/gcp/metadata.go#L218 is failing.
Could you provide a patch, so that I can stay with google security recommendations.
Thanks and best regards Stefan
I'm using a GKE Cluster with Workload Identity enabled and I'm wondering if this is the reason. Can you provide documentation how to setup knative on a cluster using Workload Identity?
cc @JRBANCEL @ZhiminXiang for GKE help.
I found the https://github.com/google/knative-gcp project and will use their GCP specific setup of knative. Therefore you can ignore my problem in this thread, because I will open a ticket in the knative-gcp project..
/assign @JRBANCEL
/area API
Is there more left to do here?
/triage needs-user-input
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
@jonjohnsonjr it looks like you've been thinking about this. Any further comments?
There's also https://github.com/knative/serving/issues/6895 that looks related; not sure if it makes sense to consolidate the two threads.
I don't think it makes sense to consolidate the two threads, but they might be resolved by the same effort.
Since I last checked, there does appear to be some activity in this general area, but I'm not personally invested in this issue, so it's unlikely that I'll try to put together a solution.
Two potential paths forward, from browsing around a bit:
I have not looked into this that much, but the second option seems way better from a maintenance perspective. Rather than linking everything directly into the binary, you could communicate over a standard interface, which is all I ever wanted.
On the other hand, this doesn't seem like something they expect workloads to use, just the kubelet: https://kubernetes.io/docs/tasks/kubelet-credential-provider/kubelet-credential-provider/
So it might not be possible to even communicate with the auth providers. Here's where the GCE provider configures it: https://github.com/kubernetes/cloud-provider-gcp/blob/8153caac26ead5b5de9a515ba7b7622a0f7ee844/cluster/gce/gci/configure-helper.sh#L3191-L3208
This package actually implements the credentialprovider interface that k8schain
operates on, using the external auth provider: https://github.com/kubernetes/kubernetes/blob/ea0764452222146c47ec826977f49d7001b0ea8c/pkg/credentialprovider/plugin/plugin.go
Ideally, we could just import that and call it a day, but I'm not sure if it's possible, and they don't like it when we import their packages: https://github.com/google/go-containerregistry/issues/496
So... /shrug
@andrewsykim seems to be the one working on this stuff on the k8s side, maybe they have some ideas?
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
What version of Knative?
Expected Behavior
Install a basic service and Revision should be ready.
Actual Behavior
Revision gets stuck:
Steps to Reproduce the Problem
// DockerConfigProvider is the interface that registered extensions implement // to materialize 'dockercfg' credentials. type DockerConfigProvider interface { // Enabled returns true if the config provider is enabled. // Implementations can be blocking - e.g. metadata server unavailable. Enabled() bool // Provide returns docker configuration. // Implementations can be blocking - e.g. metadata server unavailable. // The image is passed in as context in the event that the // implementation depends on information in the image name to return // credentials; implementations are safe to ignore the image. Provide(image string) DockerConfig }