kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
636 stars 561 forks source link

Set httpPutResponseHopLimit to 2 in instanceMetadataOptions as default when creating instance #4247

Closed wyike closed 1 year ago

wyike commented 1 year ago

/kind feature

Describe the solution you'd like [A clear and concise description of what you want to happen.]

Regarding https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4037 support, I would propose to set the default HTTPPutResponseHopLimit to 2 in container environment.

When customers is using instance profile role instead of using base64 aws credentials (very typical usage in production env), capa container needs 2 hops to retrieve aws credentials from metadata service. If default hop limit is 1, capa fails to get credentials and fail at the first with:

E0430 03:18:00.552022       1 controller.go:329] "Reconciler error" err=<
    .spec.vpc.id is set but VPC resource is missing in AWS; failed to describe VPC resources. (might be in creation process): failed to query ec2 for VPCs: NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors
 > controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="tkg-system/tkg-mgmt-aws-pvbgc" namespace="tkg-system" name="tkg-mgmt-aws-pvbgc" reconcileID=8e5f1348-7f16-4441-9d18-a4b141eb973b

If we set HTTPPutResponseHopLimit to 2 as default, it will avoid capa failure and other applications that needs to access AWS. Otherwise we need customers to set the awsmachine template explicitly:

  template:
    spec:
      instanceMetadataOptions:
        httpPutResponseHopLimit: 2

They are very likely to forget or not aware of this knowledge and get a failed env.

Another benefit is customers don't need to change awsmachinetemplate very often due to the HopLimit issue in production env , which as we known, is immutable and it is a burden to update to a new machinetemplate.

I also see HTTPPutResponseHopLimit to 2 is recommended in container environment: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html#imds-considerations

To avoid problems with instance metadata retrieval, consider the following: The AWS SDKs use IMDSv2 calls by default. If the IMDSv2 call receives no response, the SDK retries the call and, if still unsuccessful, uses IMDSv1. This can result in a delay. In a container environment, if the hop limit is 1, the IMDSv2 response does not return because going to the container is considered an additional network hop. To avoid the process of falling back to IMDSv1 and the resultant delay, in a container environment we recommend that you set the hop limit to 2

https://aws.amazon.com/about-aws/whats-new/2020/08/amazon-eks-supports-ec2-instance-metadata-service-v2/

Now, newly launched and any updated EKS managed node groups will be configured with a metadata token response hop limit set to 2. For self-managed nodes, CloudFormation templates and eksctl have been updated to launch nodes by default with a hop limit of 2.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
wyike commented 1 year ago

/assign

Skarlso commented 1 year ago

This wasn't fixed yet, right?

wyike commented 1 year ago

This wasn't fixed yet, right?

I think it's more like a feature enhancement :)

Skarlso commented 1 year ago

Yes, what I meant to say is that if it is done or not... :D

wyike commented 1 year ago

ah, SORRY...my eyes 😵... I'll submit the commit.