hashicorp / go-discover

Discover nodes in cloud environments
Mozilla Public License 2.0
563 stars 123 forks source link

AWS cloud autojoin not working on ECS task with task networking #61

Open nahuaque opened 6 years ago

nahuaque commented 6 years ago

Consul agent AWS cloud autojoin is working fine on the ECS container instance, but doesn't work when I start the agent in a task with task networking. Presumably this is because the vendored version of aws-sdk-go isn't recent enough to support obtaining credentials via Task Metadata Endpoint version 3, which was only introduced about 14 days ago.

pearkes commented 5 years ago

That sounds like a reasonable assumption. Perhaps if we get in #54 we can pair an SDK update with it.

nahuaque commented 5 years ago

I know you guy are busy and all, but AFAICT #54 is still stalled, and I think an SDK update is enough to get this working.

dekimsey commented 3 years ago

@pearkes, is there a contact we could ping to get this revisited? It's been a couple years since the initial report and I know we are interested in seeing the library have better native support for ECS.

Thank you

For instance, I was recreating our retry_join logic for Consul today and ran into an issue where the region must be specified or the ECS-based tasks wouldn't discover the region correctly. I'm guessing this has something to do with the outdated SDK, or simply insufficient testing in ECS itself.

2020-11-09T21:12:19.098Z [ERROR] agent: Cannot discover address: cluster=LAN address="provider=aws tag_key=consul-servers tag_value=amazing-courser" error="discover-aws: GetInstanceIdentityDocument failed: EC2MetadataRequestError: failed to get EC2 instance identity document
2020-11-09 15:12:19 caused by: RequestError: send request failed
2020-11-09 15:12:19 caused by: Get "http://169.254.169.254/latest/dynamic/instance-identity/document": dial tcp 169.254.169.254:80: connect: invalid argument"
dekimsey commented 3 years ago

Those following this issue may want to read the recently announced, Consul Service Mesh for Amazon ECS. Interestingly, the discover-servers component of this new architecture does not use go-discover. In my opinion, it might be possibly related to Extensible discovery for Cloud Auto-Join.

Anyway, those looking to run Consul (perhaps other products later) in ECS have hope. Looks like it's coming in one form or another!

ericbrumfield commented 3 years ago

@dekimsey thanks for that link, crossing my fingers here too as a client I work for just asked me today to try to move a 3 server node cluster to Fargate and I didn't believe it could be done yet. I wonder now in that article and if HashiCorp is alluding to whether or not future support of deploying "a production-ready Consul server" would support multi server node setups in this scenario. I'm wondering if HashiCorp would be steering this towards using the recommended 3/5 consul server node setups in ECS or if it would be limited to just 1 consul server node when it runs in this ECS or Fargate hosting context?

iandelahorne commented 2 years ago

I'm seeing this too on the 1.10.2 container image on Fargate. Our consul servers are on EC2 and discovery works great for clients on EC2 using -retry-join "provider=aws tag_key=role tag_value=consul-server"

However, if we try to deploy a client sidecar on Fargate, it does not work. Here's a snippet of the task definition:

  "containerDefinitions": [
    {
      "name": "consul",
      "image": "public.ecr.aws/hashicorp/consul:1.10.2",
      "essential": true,
      "entryPoint": ["/bin/sh", "-ec"],
      "command": [
        "ECS_IPV4=$(curl -s $ECS_CONTAINER_METADATA_URI_V4 | jq -r '.Networks[0].IPv4Addresses[0]')\n exec consul agent -advertise \"$ECS_IPV4\" -datacenter development -retry-join \"provider=aws tag_key=role tag_value=consul-server\" -data-dir /consul/data"
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-create-group": "true",
          "awslogs-group": "/ecs/consultest",
          "awslogs-region": "us-west-2",
          "awslogs-stream-prefix": "consul"
        }
      },
      "portMappings": [
        {
          "containerPort": 8300,
          "hostPort": 8300,
          "protocol": "tcp"
        },
        {
          "containerPort": 8300,
          "hostPort": 8300,
          "protocol": "udp"
        }
      ]
    },
  "placementConstraints": [],
  "requiresCompatibilities": [
    "FARGATE"
  ],

In the logs, we see:

[ERROR] agent: Cannot discover address: cluster=LAN address="provider=aws tag_key=role tag_value=development" error="discover-aws: GetInstanceIdentityDocument failed: EC2MetadataRequestError: failed to get EC2 instance identity document
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/dynamic/instance-identity/document": dial tcp 169.254.169.254:80: connect: invalid argument"

According to the ECS task IAM role docs, inside ECS the container IAM role should be fetched from http://169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI