libdns / route53

AWS Route53 provider implementation for libdns
MIT License
14 stars 27 forks source link

Sometimes does not fall-back to IMDSv1 when using EC2 Instance Profiles #58

Open seansaleh opened 1 year ago

seansaleh commented 1 year ago

Spoiler: I fixed this by setting the AWS value for HttpPutResponseHopLimit to 2. This allows the AWS SDK to use IMDSv2 inside a docker container on an EC2 machine.

Fixes:

Details: I was running into failures when using route53 via caddy-dns/route53 inside a docker container on an EC2 node with an IAM role attached to the EC2 machine. Note, I was using staging Let's Encrypt, but I don't think that's relevant.

IMDSv1 and IMDSv2 were both enabled for the EC2 machine.

I'm still not sure why I ran into this issue just now, I had a machine on a different AWS account run the same code without issues. But I had run into issues in the past where using tools which had updated to newer versions of the AWS SDK for Go were failing until either IMDSv2 was disabled or an additional hop was allowed, so I had an idea of how to try to fix something here.

Versions: Caddy: 2.6.4 caddy-dns/route53: v1.3.2 /libdns/route53: v1.3.2

The failing logs: The two relevant sections I see are could not determine zone for domain and unexpected response code 'SERVFAIL'

{"level":"error","ts":1684773178.4666998,"logger":"http.acme_client","msg":"cleaning up solver","identifier":"example.com","challenge_type":"dns-01","error":"no memory of presenting a DNS record for \"_acme-challenge.example.com\" (usually OK if presenting also failed)"}
{"level":"error","ts":1684773178.5130796,"logger":"tls.obtain","msg":"could not get certificate from issuer","identifier":"example.com","issuer":"acme-staging-v02.api.letsencrypt.org-directory","error":"[example.com] solving challenges: presenting for challenge: could not determine zone for domain \"_acme-challenge.example.com\": unexpected response code 'SERVFAIL' for _acme-challenge.example.com. (order=https://acme-staging-v02.api.letsencrypt.org/acme/order/REDACTED/REDACTED) (ca=https://acme-staging-v02.api.letsencrypt.org/directory)"}
{"level":"error","ts":1684773178.5131493,"logger":"tls.obtain","msg":"will retry","error":"[example.com] Obtain: [example.com] solving challenges: presenting for challenge: could not determine zone for domain \"_acme-challenge.example.com\": unexpected response code 'SERVFAIL' for _acme-challenge.example.com. (order=https://acme-staging-v02.api.letsencrypt.org/acme/order/REDACTED/REDACTED) (ca=https://acme-staging-v02.api.letsencrypt.org/directory)","attempt":26,"retrying_in":21600,"elapsed":324020.948514926,"max_duration":2592000}

Full fix with Terraform: My full fix since I was using Terraform was to set the following block in my aws_instance:

  metadata_options {
    http_put_response_hop_limit = 2
    http_endpoint = "enabled"
  }
theryecatcher commented 1 year ago

Stumbled on this issue while trying dns validations from Caddy running in a docker container with the route53 module used for DNS and was seeing them to be failing. In my case the error was as below.

failed to sign request: failed to retrieve credentials: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded

Versions: Caddy: 2.7.2 caddy-dns/route53: v1.3.3 /libdns/route53: v1.3.3

This did take my whole server down. Thanks for the workaround.