spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.79k stars 472 forks source link

spire-server: [AWS Node Attestor] Feature Request / Enhancement for node attestation #4770

Closed rushi47 closed 6 months ago

rushi47 commented 9 months ago

Hello Team πŸ‘‹ Hope you are well.

We are looking for a feature, where in we add one more check to node attestation flow. When this feature is enabled, node can only be attested only if the account id of the node making the request belongs to configured AWS Organization. If it does, continue with the existing flow; otherwise reject attestation request.

If I understand currently we have something like this implemented in code 1, 2 but it's not documented anywhere. It looks like feature is designed in such a way that we can have static list of allowed accounts during bootstrapping configuration. And when node makes an attestation request, account id of the node can be validated against afore-mentioned static list. Please let me know, if there are any gaps in my understanding is correct. This is exactly what we want, but it is difficult to manually manage this static account list.

Potential Solution One possible solution could be to retrieve the list of account ids periodically from AWS Organization, store them locally and then verify if the account id is part of the stored list.

To get the list of accounts from AWS Organization, we will need to call ListAccount API. From documentation, it looks like that this API can only be called from management account or member account.

This operation can be called only from the organization's management account or by a member account that is a delegated administrator for an AWS service.

For the same reason, we can create Role in management account with respective permission and create trust relation to assume this role in accounts running SPIRE Regional Servers. Sample configuration can look something like this for the same :

      account_ids_belong_to_org_validation = {
        org_account_id     = "7891011"
        org_account_role   = "spire-server-org-role"
        org_account_region = "us-west-2"
        org_account_map_ttl = "6"
      }

To avoid all the consistency issues with caching accounts locally, we plan to periodically overwrite the cache, at least every hour. As we were thinking, creating whole new account isn't such frequent operation, an hour should suffice but can be set to more using : org_account_map_ttl.

We also looked at DescribeAccount API, but as we would need to make this call for every node attestation request we thought this will be easy target for DDoS. Also if there are many nodes we might face situation where we will not be able to attest nodes due to throttling, as it will be the only account making the DescribeAccount call.

Looking forward to hearing your thoughts about this feature.

Cheers.

PS : I am more than happy to work on this feature :)

rushi47 commented 9 months ago

Hello @evan2645 Hope you are doing well Just wanted to ping you and hear your thoughts. Our team is quite keen about this feature and we are working on this in our private fork. So wanted to hear your thoughts.

evan2645 commented 8 months ago

Hi @rushi47 thanks for your patience, and for the PR to reference. As I mentioned on Slack, I feel this is a great addition, and have been thinking through how to best expose it.

We currently have an assume_role configurable that allows you to provide a role name which SPIRE will try to assume in the target account in order to perform server-side node attestation. It would be nice to be able to re-use this. For example, a config require_org_account_id, which in turn requires assume_role to be set, and together we can get form the correct ARN of the role to assume

Why do we need region to be set? Is account api not global? Can we hardcode or have a sane default?

I think that would be a decent experience, then we would need to document the behavior and also the IAM policy required to make this work. Next issue is the implementation and working around rate limits ...

I worry about configurable cache ttl, and that there will never be a right answer. Some people won't want to wait an hour for attestations to succeed after creating an account (I know I wouldn't). What are the rate limits on the two APIs you mentioned? I feel this part needs a little more thinking

rushi47 commented 8 months ago

Thank you @evan2645 for your detailed response. Below are my thoughts :

We currently have an assume_role configurable that allows you to provide a role name which SPIRE will try to assume in the target account in order to perform server-side node attestation. It would be nice to be able to re-use this. For example, a config require_org_account_id, which in turn requires assume_role to be set, and together we can get form the correct ARN of the role to assume

That's right, I was thinking : After reading docs and from my understanding, we need to create Role in root account and assign policy to it as ListAccount / DescribeAccount. And we will need to add trust relation between accounts running spire servers and root account. We were thinking of two possibilities

Maybe we can refactor and design in such way that if the role name is configured, we will use it. If not, we can use the existing role used for node attestation. What are your thoughts on that ?

Why do we need region to be set? Is account api not global? Can we hardcode or have a sane default?

Organizations look to have endpoints from different regions as well per sdk documentation. I also didn't use the existing region as I thought the Spire Servers and their organization might be running in different regions, as per afore mentioned doc. But please lmk your thoughts here, would you suggest to use the existing one configured or any other one ? I have recently started working with AWS API, so not super familiar with patterns

One more reason to configure region was to get away with making a lot of logic changes to existing code base. Currently as part of getClient where we cache the clients, we use region as Cache key, but this should be easy to change if passing region for aws organization is not a problem.

These were my thoughts for setting region explicitly, I totally agree with defaults here.

I worry about configurable cache ttl, and that there will never be a right answer. Some people won't want to wait an hour for attestations to succeed after creating an account (I know I wouldn't).

Totally agree, to this and we were trying to avoid all this TTL. One of the thoughts for using an hour was, we were questioning and considering how frequently is new account create operation performed ? And also if it's frequent, next thought was from creating account to spinning up spire should take at least a hour (cause assuming creating roles/policies, vpc etc etc.) so we thought this could work out. But like you said this assumption might not be valid for broader community.

What are the rate limits on the two APIs you mentioned? I feel this part needs a little more thinking

I searched for this and I couldn't find Rate limits for Organization API and this makes me more skeptical for making an call to AWS on each attestation request. As there might N nodes coming at one instance for attestation and will run out of the limits. Reason why we thought ListAccounts will be good option with catching. Like EC2, this page is for orgnization docs, doesn't define any limits.

I think after your suggestions implemented, i am trying to think config will look something like this :

      account_ids_belong_to_org_validation = {
        org_account_id     = "7891011"
      }

assuming we can set defaults for below params :

        org_account_role   = "spire-server-org-role" <---- use same role for node attestation 
        org_account_region = "us-west-2" <---- default region we agree on
        org_account_map_ttl = "6" <--- default ttl if we go with existing route 

cc: Tagging my team mates here if they want to add anything to above response @achaurasiaConfluent @coatestiger @greghensley

achaurasiaConfluent commented 8 months ago

We currently have an assume_role configurable that allows you to provide a role name which SPIRE will try to assume in the target account in order to perform server-side node attestation.

account_ids_belong_to_org_validation. org_account_role does not live in every target account. This role only exists in org account. AWS recommends that this root org account should not have any compute and storage in it.

AWS Region We could re-use the region I think, AFAIK, IAM is available in global region.

evan2645 commented 7 months ago

Hi @rushi47 thank you for your patience here, I was traveling most of last week.

Just wanted to write a summary of what we discussed on the previous call:

evan2645 commented 7 months ago

IMO the big blockers here are figured out ... the shape of config is important but less fundamental, we can discuss it further in review. I'll remove the unscoped label

rushi47 commented 7 months ago

Thanks @evan2645 for writing this out. I will try to refactor the code and do required changes, to incorporate the above requirement.

rushi47 commented 7 months ago

Hey @evan2645 So we were discussing internally about this feature. And we have two approaches :

  1. We have fix counter X and in the ttl period, we do cache burst for those fix number of X times. So we know if there is request made which is not part of cache, it will be checked against aws api. Unless the limits X is exhausted but i think thats expected. In this case there could be client, which can exhaust limit at start of window and all subsequent request will fail even if those are genuine.

  2. There was one more interesting approach suggested by @greghensley, of having two ttls.

First ttl - main ttl, this is what operator will configure and cache will be refreshed once ttl is expired. What we use in approach 1.

Second ttl lets call it neg-ttl instead of X retries, we specify very short time period like 10s replacing that X. And we do cache burst once that second neg-ttl is over those 10s. And we keep doing it over that period of main ttl.

In this case it's quite smart cache refresh but request might still fail if its landed in those 10s.

Wanted to hear your thoughts on this. Before I open PR and add this feature.

evan2645 commented 7 months ago

I think these approaches can be summarized as X tries over ttl period of time, or 1 try over neg-ttl period of time? And I guess for the first one, there is a possibility that X == 1? For me, the difference is nuanced and I'm not sure that one over the other has an outsized impact on the experience.

I think there's an extra detail about option 1 that's not currently captured , and that is the rate of account creation or total number of accounts created over ttl period. Successful retrieval after cache miss should update the cache, so e.g. turning on a new fleet in a new account should trigger at most one cache miss call per SPIRE Server. The value of X only comes into play when you turn on more than 1 account in ttl period.

If we are worried about DoS, another consideration is crash loop. An agent in an account that's not part of your org, or a malicious actor, can slowly drain the bucket by continually retrying against an account that will never succeed. So some level of predictability on the upper limit of such a pattern is also desirable. I think the first approach has a slight strength in this regard (e.g. if X == 1, your upper limit is 2x ttl .. compared to approach two, where the interaction between ttl and neg-ttl is more complex).

My gut feeling is that the refresh period (referenced as ttl above) should be somewhere between 1-5 mins, and (if taking approach one) X should be 1 or 2. Harder for me to say what a reasonable neg-ttl value would be if that's the route we took. Maybe 1 minute? Perhaps the main effect of neg-ttl is to spread out X when neg-ttl < ttl πŸ€”

Lastly, we'll be picking the 1.9.2 release candidate commit in one week from today. It may take a few days to turn reviews around, especially if changes are needed. Either approach seems fine to me, with a slight preference for the first due to simplicity... will probably be easier to implement and review, easier to predict behavior, etc.

rushi47 commented 7 months ago

Thanks @evan2645 for detailed comment, owing to the behavior we are expecting we went with first approach. Currently I have configured retries to higher number but we can easily change it.

Lastly, we'll be picking the 1.9.2 release candidate commit in one week from today.

Owing to deadline I have opened up PR. I can see, it's showing git conflict in go.mod will resolve it soon but meanwhile feel free to take look.