kubernetes / cloud-provider-aws

Cloud provider for AWS
https://cloud-provider-aws.sigs.k8s.io/
Apache License 2.0
396 stars 304 forks source link

Unable to Get ECR Creds - context deadline exceeded #1029

Closed mzameer777 closed 1 month ago

mzameer777 commented 1 month ago

What happened: I'm trying to implement image pull from private ECR, I have installed and configured ecr-credential-provider plugin, I'm getting this error in kubelet logs, and can't figure out how to proceed further

{"ts":1727300113642.7864,"caller":"plugin/plugin.go:235","msg":"Failed getting credential from external registry credential provider: error execing credential provider plugin ecr-credential-provider for image XXX.dkr.ecr.us-west-2.amazonaws.com/cilium/cilium: context deadline exceeded: I0925 21:34:13.657531    6329 main.go:129] Getting creds for private image XXX.dkr.ecr.us-west-2.amazonaws.com/cilium/cilium\nW0925 21:34:13.657585    6329 main.go:65] No region found in the image reference, the default region will be used. Please refer to AWS SDK documentation for configuration purpose."}

The plugin binary is executed and it says context deadline exceeded.

below is my configuration, I'm using Talos, so this is the creds config patch

machine:
  kubelet:
    credentialProviderConfig:
      apiVersion: kubelet.config.k8s.io/v1
      kind: CredentialProviderConfig
      providers:
        - name: ecr-credential-provider
          matchImages:
            - "*.dkr.ecr.*.amazonaws.com"
          defaultCacheDuration: "12h"
          apiVersion: credentialprovider.kubelet.k8s.io/v1

Environment:

/kind bug

k8s-ci-robot commented 1 month ago

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
cartermckinnon commented 1 month ago

You're probably getting the timeout here: https://github.com/kubernetes/cloud-provider-aws/blob/d7e05d57709cd46297490b51ce0dd11a54dbea35/cmd/ecr-credential-provider/main.go#L139

Do your nodes have network access to the ECR endpoint?

(the warning you're seeing in the output there is misleading but harmless. Fix for that in #1030)

mzameer777 commented 1 month ago

I can confirm that the node has network connectivity to ECR VPC endpoint and it also has ECR full permissions.

What else can I look for, is there a way to debug this further in my env?

cartermckinnon commented 1 month ago

You verified that aws ecr get-login-password works on the node?

You can try to reproduce the cred provider failure with something like:

echo '{"kind":"CredentialProviderRequest","apiVersion":"credentialprovider.kubelet.k8s.io/v1","image":"$IMAGE"}' | ecr-credential-provider
mzameer777 commented 1 month ago

I was able to resolve this, my cluster was not having connectivity to ecr api. Thanks for helping me debug this.