Define a usage plan and API key for the Confluence workflow

nikki-t commented 3 months ago

Modify Hydrocron Terraform to define usage plan parameters:

Limit (requests per month)
Burst limit (concurrent requests)
Rate limit (requests per second)

Decide on how we want to handle API keys:

Should we define one API key per use case? I believe this will means the usage plan limits will apply to each key separately and are not aggregated over all API keys in a usage plan.
Or should a single API key be used?

@frankinspace - How did you envision using the keys?

nikki-t commented 3 months ago

Requested limits are:

Limit: 11,107,815 requests per month + additional 1,000,000 for testing
Burst limit: 3,000
Rate limit: 6,000

nikki-t commented 2 months ago

@torimcd and @frankinspace - I forgot to mention that I would also like to tackle this issue during the current sprint if possible, I don't think it should be too much work.

Do we want to provide each trusted partner with their own API keys so we can isolate and track their usage against a usage plan?

Should we start with these defaults for the usage plan (based on the expected Confluence usage):

Limit: 11,107,815 requests per month + additional 1,000,000 for testing
Burst limit: 3,000
Rate limit: 6,000

nikki-t commented 2 months ago

We decided to use a single usage plan starting with the following limits:

Limit: 11,107,815 requests per month + additional 1,000,000 for testing
Burst limit: 3,000
Rate limit: 6,000

We will implement one API key per trusted partner to help track and isolate usage.

nikki-t commented 2 months ago

I think I have the limit and API keys almost implemented but ran into a snag that set me back a little bit.

Previously I developed the code under the assumption that we would use a single API key for trusted partners but since then we have decided to use one key per user and need a way to store a list of keys. This required some changes:

Modified the API Gateway Terraform to add a new key and store that key in a list which gets encoded as a string and stored in the trusted partner SSM parameter.
Modified the Lambda authorizer code to read in the trusted key list and determine if the key passed in the x-hydrocron-key header is in the trusted key list in order to create a policy that will cause the current request to count against the usage plan. If the key is not in the header or the header does not exist then the default usage plan is used so previous functionality is not impacted.

This has caused a little bit of an issue with how the Lambda authorizer is written. The SSM parameters are written such that the values are retrieved during the Lambda INIT phase or during static initialization. This means that any changes made to the parameter are not immediately detected by the Lambda as the value would only be pulled when a new Lambda container is created and executed.

I tried to re-deploy the Lambda authorize using the replace_triggered_by Terraform lifecycle meta-argument but it produces an error saying the resource already exists. I have tried various different parameters and always end up with the same result.

So I am thinking we will want to move the retrieval of the trusted key out of static initialization so we can detect changes in the SSM parameter and use newly created keys. Or we can set an environment variable that when set causes the authorizer to query the parameter which we can manually set to another value after the initial execution of the authorizer.

I don't really like either option but can't think of any thing else. @frankinspace - Do you have any thoughts on this or maybe a different way to handle the API keys?

frankinspace commented 2 months ago

The other option would be to load the api key list as part of the terraform deployment. We could have the terraform accept the list as a variable and modify the deploy step in the build pipeline to pass the list in from an repo environment secret. Changing the list would then require redeployment but we should be able to do that with a workflow_dispatch run of the build (which can specify whatever tag version is already in ops).

frankinspace commented 2 months ago

After discussion we decided it made sense to move away from being able to dynamically add/remove trusted API keys for a few reasons:

There is other questions/documentation that needs to be captured when new API keys are created and distributed. For example, what is the expected usage level? How much will that affect the cost of the API?
It is important to have an audit trail for when these decisions are made
There is a way to tie it in to our existing release process so that adding and removing can still be done quickly

Therefore, instead of looking up the list of trusted API keys from SSM, we will instead transition to passing in the list of API keys as a secure environment variable passed in to the lambda function. The API keys will be created/defined in the terraform scripts which will require an issue & PR in order to change (which will act as the audit trail for adding or removing valid keys). Addition or removal of a trusted API key will be implemented and deployed as a patch release which will allow us to bypass other changes in development in order to quickly deploy to UAT followed by OPS.

Work remaining for this ticket:

remove ssm lookup for api key
transition to passing in api key values as env var
mark api key as sensitive in terraform
encrypt the environment variable in transit and at rest https://github.com/podaac/hydrocron/issues/174

Nice to have, potentially as distinct feature:

Create request template for requesting new API keys: https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository

podaac / hydrocron

Define a usage plan and API key for the Confluence workflow #205