Improve support for ECS workloads

azdagron commented 2 years ago

This issue tracks a discussion on how to provide improved support for workloads running in ECS or similar environments. Due to the auto-scaling nature of these types of environments, current node attestation methods fall short in providing unique identities to the agents running in these workloads.

A previous discussion has been had regarding an IAM node attestor and the maybe a credential exchange API in issue #2780.

PR (#3231) proposed a change to randomize the ID provided by x509pop so that all agents in a task could present the same keypair but otherwise have unique identities.

Ideally we adapt an existing or develop a new node attestation method that allows us to uniquely and strongly identify each agent.

hellerda commented 1 year ago

Based on recent testing and discussions on Slack, I think these are the available options for deploying Agents in ECS. The scenario is an Agent running as a sidecar to the ECS workload, i.e. running in another container in the same ECS Task.

Node-level attestation:

x509pop - Simple, already-available solution. Requires some support on top to distribute x509 certs before agent can attest. But some users have reported doing this, and using this attestor for ECS.
"aws_iam" attestor (proposed) - Well suited to ECS environment; requires no prior configuration to allow agents to attest: only that agent is able to assume an IAM Role. Has some challenges in providing further "node resolution" to safely determine agent attributes beyond just the IAM role. Even the AWS "role session name" (current session of IAM role) is challenging to safely verify. But in a use case where the Workload's IAM role suffices for identification, and the IAM role is sufficiently secured (that is: you can trust that only authorized workloads are able to assume this role), this does present a workable solution today. (see #2780)

Workload attestor:

ECS-specific attestor (proposed) - This was mentioned by a user on Slack, who claimed to have a working prototype. I don't have details but presumably works in a manner similar to the current "k8s" attestor for Kubernetes: relying on some workload-level identity provisioned by ECS (IAM role?) and having the Agent be able to verify this through calls to ECS API.
UNIX Workload attestor - The current UNIX attestor requires the workload (accessing the Agent UDS) to be in the same process namespace, which is not possible on ECS Fargate (see: https://github.com/aws/containers-roadmap/issues/1449 and https://docs.aws.amazon.com/app-mesh/latest/userguide/tls.html#mtls-configure-ecs)

ECS supports "pidMode" but only for ECS EC2, not ECS Fargate (see: https://aws.amazon.com/about-aws/whats-new/2018/10/amazon-ecs-now-allows-two-additional-docker-flags-/ and https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_TaskDefinition.html#ECS-Type-TaskDefinition-pidMode)

I did not test on ECS EC2 but I assume it would work there with "pidMode". I did test on Fargate, with an agent and workload in different containers in the same ECS task, with a shared ECS bind mount between them and the UDS hosted over this mount. I found that the workload could access the UDS OK but could not attest due to the above issue.

Expanding on the UNIX Workload attestor: I see these possibilities:

AWS ECS to provide pidMode support on Fargate. While it would likely never be supported between ECS Tasks, it could be supported between containers in the same Task. If so it would solve the issue.
UNIX Workload attestor to support workloads in a different namespace. This would require SCM_CREDENTIALS support in the Linux kernel for a process calling from a different namespace (i.e. translate_pid() syscall). I'm not clear if the support is in Linux kernel to do this and the SPIRE attestor does not implement it, or if it's not doable at all with this workload attestation method. But if it could be done it would solve the issue.
Use a modified UNIX attestor with workload attestion disabled. I did some testing with an agent and workload in different containers in the same ECS task (in Fargate), and built a modifed UNIX attestor that always returns "PID=1" from the call to syscall.GetsockoptUcred(). This fools the attestor into thinking the calling process is local PID 1, which effectively "disables" the workload attestation while still allowing access to the UDS. With this mod, the workload in the other container was able fetch an SVID.

This is of course something you would never do in the general-purpose UNIX attestor. But in the case of two containers running in the same ECS task, it is assumed that the only workload authorized to access workload API is running in the same task. That is: a one-to-one correspondence between agent and workload. So it seems safe, except for the exposure of the bind mount on the ECS node itself. It's an ephemeral mount, but it's a mount on the server somewhere.

Access the UDS over a TCP connection - This approach was suggested by a member of my team and seems viable. The Workload API SDK apparently supports this already. Taking some standard test workloads that call workload API, and modifying the workloadapi.WithClientOptions() call (golang SDK) to pass a TCP socket address, the client was able to do this. As tThe SPIRE agent does not host a TCP port directly, you must front the UDS with a TCP proxy such as an SSH tunnel, or "socat". With this method, the workload in the other container was able fetch an SVID. It required no mod to the UNIX attestor and no shared bind mount. As such, it seems preferable to 3. Although like 3, the attestor is not really verifying the workload process. But in this environment that seems OK.

Also: this is rather stating the obvious but if the Agent is running in the same container as the workload, there is no issue. All the above is if you want it running in a sidecar.

Finally, will mention that another, orthogonal solution is SVID Store to cloud provider secrets store #1843. In this case there is no concern of Agent proximity to workload, so none of the above limitations apply.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open for 365 days with no activity.

github-actions[bot] commented 10 months ago

This issue was closed because it has been inactive for 30 days since being marked as stale.

evan2645 commented 10 months ago

@LukeMwila volunteered to work on this issue, so I'll go ahead and re-open it. Thanks Luke!!

LukeMwila commented 10 months ago

@evan2645 happy to work on this one!

spiffe / spire

Improve support for ECS workloads #3261