microsoft / azure-pipelines-terraform

Azure Pipelines tasks for installing Terraform and running Terraform commands in a build or release pipeline.
MIT License
95 stars 59 forks source link

enhancement: update to use env vars for WIF backend #214

Closed jaredfholgate closed 1 week ago

jaredfholgate commented 4 months ago

This is to resolve the issues some users are seeing with WIF auth.

E.g. #201

When supplying the WIF OIDC token to the backend via CLI args it is cached in the plan file or cached in the config.

As such, subsequent commands that have long delays between them result in a timeout for anything accessing the state file.

This PR moves those settings to env vars to remove this problem.

However this does break the current behaviour where we allow a separate service connection for backend and plan / apply. We can now only support one set of credentials for accessing the state file during plan and apply. This is due to a limitation of the azurerm backend, which does not support using different env vars or the az cli at present. If the upstream functionality is introduced, then this can be re-introduced to the task later.

mericstam commented 4 months ago

Hi Jared, Could there be a scenario when a backendSPN has a different authentication scheme to the environmentSPN and then we get into trouble with setting env vars? or will this fix only be for WIF and we would not consider mixing authentication schemes?

jaredfholgate commented 4 months ago

Hi Jared, Could there be a scenario when a backendSPN has a different authentication scheme to the environmentSPN and then we get into trouble with setting env vars? or will this fix only be for WIF and we would not consider mixing authentication schemes?

This fix is only for WIF, other auth methods will not be impacted. In the scenario that someone uses Client Secret for backend and WIF for plan / apply, for example, that should still work if it did before. I have not ever tested that scenario of using a different auth type or a different service connection for init and plan / apply. It is not well supported by the Terraform CLI since there is overlap with environment variable names, etc.

I am unsure if we should even support a separate service connection for the backend. The CLI only really supports that by fluke by nature of it persisting the credentials if you supply them via CLI args. There is no way to separate them if using env vars. Supplying the creds via the CLI args is also insecure as they get persisted to the plan file.

If I was approaching this now given the current state of the provider and backend auth support, I would only have a single service connection input. I think if the backend in the CLI gets better support for separation in a secure way then I would support it again. I really want to submit a PR to Terraform Core for this at some point, but it is big job so need to find time.

In summary, this should have no impact on the edge cases. The only impact is on WIF, which can't support two different identities as it stands and still work with any kind of delay due to the provide and core limitations on the Terraform side.

jaredfholgate commented 4 months ago

@mericstam Please note all the commits here were me trying to add an Action for running the unit tests. Kept hitting weird npm issues, so took a bit of figuring out. Got them running now though, so should ensure that all unit tests pass on a PR going forward.

mericstam commented 4 months ago

Hi Jared, Could there be a scenario when a backendSPN has a different authentication scheme to the environmentSPN and then we get into trouble with setting env vars? or will this fix only be for WIF and we would not consider mixing authentication schemes?

This fix is only for WIF, other auth methods will not be impacted. In the scenario that someone uses Client Secret for backend and WIF for plan / apply, for example, that should still work if it did before. I have not ever tested that scenario of using a different auth type or a different service connection for init and plan / apply. It is not well supported by the Terraform CLI since there is overlap with environment variable names, etc.

I am unsure if we should even support a separate service connection for the backend. The CLI only really supports that by fluke by nature of it persisting the credentials if you supply them via CLI args. There is no way to separate them if using env vars. Supplying the creds via the CLI args is also insecure as they get persisted to the plan file.

If I was approaching this now given the current state of the provider and backend auth support, I would only have a single service connection input. I think if the backend in the CLI gets better support for separation in a secure way then I would support it again. I really want to submit a PR to Terraform Core for this at some point, but it is big job so need to find time.

In summary, this should have no impact on the edge cases. The only impact is on WIF, which can't support two different identities as it stands and still work with any kind of delay due to the provide and core limitations on the Terraform side.

@jaredfholgate, Thanks for clarifying. I have tested a bit with different auth methods for backend and environment service connections but it doesn’t work well. Even having two service connections with the same tenant and different subscriptions doesn’t work well, there are some issues reported . So yes I agree the functionality shouldn’t have been there in first place. However I think a lot of users like that feature even though it only works by accident.

mericstam commented 4 months ago

@mericstam Please note all the commits here were me trying to add an Action for running the unit tests. Kept hitting weird npm issues, so took a bit of figuring out. Got them running now though, so should ensure that all unit tests pass on a PR going forward.

Got it. great you fixing running the unittests in pr workflow!

jaredfholgate commented 3 months ago

@mericstam Thanks for the version reminder. You have told me this before and I totally forgot! :) Would it be possible to get this pushed to dev / test soon?

mericstam commented 3 months ago

@jaredfholgate, had some issues with the pipeline project. Got that sorted out. Building your branch to dev now!

mericstam commented 3 months ago

hey @jaredfholgate, please let me know if you have access to the dev version of the extension, or should I push to test version?

jaredfholgate commented 2 months ago

hey @jaredfholgate, please let me know if you have access to the dev version of the extension, or should I push to test version?

Hi @mericstam. Apologies for the delay, but just got to testing this. I am having an issue with the dev instance of the task. It appears that the extension (zip) is empty.

mericstam commented 2 months ago

@jaredfholgate . Looks good to me. did you mean extension zip is totally empty or just missing your latest changes? version should be 0.1.95

mericstam commented 2 months ago

@jaredfholgate. Also pushed version 0.1.35 to test feed

jaredfholgate commented 2 months ago

@jaredfholgate. Also pushed version 0.1.35 to test feed

@mericstam Looks like I was downloading dev and that is the empty zip. But there is still a problem with test. I am seeing this error message: ##[warning]Failed to download task 'TerraformTaskV4'. Error No task definition found matching ID fe504acc-6115-40cb-89ff-191386b5e7bf and version 4.236.35. You must register the task definition before uploading the package.

I think this was due to a discrepancy between task.json and task.loc.json (my fault). I have pushed a new commit using the latest sprint version, which will hopefully fix it.

Sorry for my confusion and messing it up. I'm aiming for a record for the longest running PR in history. :)

mericstam commented 2 months ago

@jaredfholgate re-built and deployed new versions to dev and test

jaredfholgate commented 2 months ago

@mericstam I have finally got round to testing. It all looks good with the exception of the warnIfMultipleProviders() method. That runs a terraform providers command is dependent on credentials supplied during the init. I have re-ordered it to move it after the credentials are handled, which will hopefully remove the warning.

I'm not really sure what the purpose of that method is to be honest with you, it seems a bit like we are getting more involved in the quality of the customers code than we need to be and it only checks for the azurerm provider, not azapi or azuread for example. I would suggest removing that method altogether unless you know a good reason it is present?

jaredfholgate commented 2 months ago

@mericstam Just a ping to see if we can get an updated release and get this one closed off. Thanks

mericstam commented 2 months ago

@jaredfholgate New versions pushed to Dev and Test : )

jaredfholgate commented 2 months ago

@mericstam I have tested this now and can confirm the terraform providers error has been resolved. We can go ahead and release this now if you are happy with the changes.

andrewdmay commented 2 months ago

I saw the discussion here - we're currently making use of the separate WIF service connections for backend and environment, so it looks like this change would break that set-up.

Not having things timeout after 10 minutes would be a big improvement, but this is a breaking change for us and we'd have to reorganize/rethink our service connections and their permissions.

Has there been any thought of making a new version and having this change there rather than in V4?

jaredfholgate commented 2 months ago

I saw the discussion here - we're currently making use of the separate WIF service connections for backend and environment, so it looks like this change would break that set-up.

Not having things timeout after 10 minutes would be a big improvement, but this is a breaking change for us and we'd have to reorganize/rethink our service connections and their permissions.

Has there been any thought of making a new version and having this change there rather than in V4?

Hi. I guess we could do that or we could add a parameter to revert to the previous behaviour. Unfortunately, as discussed here. Although this task supports two service connections, the reality is that Terraform CLI and the providers are not built to support multiple identities in a secure way (i.e. with env vars). It is more luck than design that it works for this task in my view.

The ultimate solution to support this is to update the Terraform CLI to support alternative env vars for backend and / or Azure CLI.

mericstam commented 2 months ago

I saw the discussion here - we're currently making use of the separate WIF service connections for backend and environment, so it looks like this change would break that set-up. Not having things timeout after 10 minutes would be a big improvement, but this is a breaking change for us and we'd have to reorganize/rethink our service connections and their permissions. Has there been any thought of making a new version and having this change there rather than in V4?

Hi. I guess we could do that or we could add a parameter to revert to the previous behaviour. Unfortunately, as discussed here. Although this task supports two service connections, the reality is that Terraform CLI and the providers are not built to support multiple identities in a secure way (i.e. with env vars). It is more luck than design that it works for this task in my view.

The ultimate solution to support this is to update the Terraform CLI to support alternative env vars for backend and / or Azure CLI.

I would prefer not revert changes if possible. I am not sure a V5 is the best approach either, that means we need to support both V4 and V5 going forward, like support for node v20 that is around the corner. That would mean we are splitting the codebase and then something like v4.1 (or v6) and a v5.1(or v7) for node 20 support. I think the best way is to add a toggle to enable/disable the changes, or what do you think @jaredfholgate ?

jaredfholgate commented 2 months ago

I saw the discussion here - we're currently making use of the separate WIF service connections for backend and environment, so it looks like this change would break that set-up.

Not having things timeout after 10 minutes would be a big improvement, but this is a breaking change for us and we'd have to reorganize/rethink our service connections and their permissions.

Has there been any thought of making a new version and having this change there rather than in V4?

Hi. I guess we could do that or we could add a parameter to revert to the previous behaviour. Unfortunately, as discussed here. Although this task supports two service connections, the reality is that Terraform CLI and the providers are not built to support multiple identities in a secure way (i.e. with env vars). It is more luck than design that it works for this task in my view.

The ultimate solution to support this is to update the Terraform CLI to support alternative env vars for backend and / or Azure CLI.

I would prefer not revert changes if possible. I am not sure a V5 is the best approach either, that means we need to support both V4 and V5 going forward, like support for node v20 that is around the corner. That would mean we are splitting the codebase and then something like v4.1 (or v6) and a v5.1(or v7) for node 20 support. I think the best way is to add a toggle to enable/disable the changes, or what do you think @jaredfholgate ?

Yes, I agree the flag is a better approach. I will look into making an update. I'm thinking we should support env vars for client secret anyway, so I will see if I can make it a bit more generic and agnostic of auth type. I think we'll need to default to the new behaviour for WIF given the issues reported, but default to existing behaviour for other auth types. I will be careful to ensure changes are backwards compatible whatever the case.

I am also reaching out to the product group around proper support for multiple identities in the azurerm backend. And also azure cli support. This will obviously come later, but will consider this may come for any changes here.

jaredfholgate commented 2 months ago

I saw the discussion here - we're currently making use of the separate WIF service connections for backend and environment, so it looks like this change would break that set-up.

Not having things timeout after 10 minutes would be a big improvement, but this is a breaking change for us and we'd have to reorganize/rethink our service connections and their permissions.

Has there been any thought of making a new version and having this change there rather than in V4?

Hi. I guess we could do that or we could add a parameter to revert to the previous behaviour. Unfortunately, as discussed here. Although this task supports two service connections, the reality is that Terraform CLI and the providers are not built to support multiple identities in a secure way (i.e. with env vars). It is more luck than design that it works for this task in my view.

The ultimate solution to support this is to update the Terraform CLI to support alternative env vars for backend and / or Azure CLI.

I would prefer not revert changes if possible. I am not sure a V5 is the best approach either, that means we need to support both V4 and V5 going forward, like support for node v20 that is around the corner. That would mean we are splitting the codebase and then something like v4.1 (or v6) and a v5.1(or v7) for node 20 support. I think the best way is to add a toggle to enable/disable the changes, or what do you think @jaredfholgate ?

Yes, I agree the flag is a better approach. I will look into making an update. I'm thinking we should support env vars for client secret anyway, so I will see if I can make it a bit more generic and agnostic of auth type. I think we'll need to default to the new behaviour for WIF given the issues reported, but default to existing behaviour for other auth types. I will be careful to ensure changes are backwards compatible whatever the case.

I am also reaching out to the product group around proper support for multiple identities in the azurerm backend. And also azure cli support. This will obviously come later, but will consider this may come for any changes here.

@mericstam I have made the updates as discussed and added some unit tests. I took the opportunity to add the Azure AD auth setting too.

Both new settings default to existing behaviour when not specified, so this should not be a breaking change for users.

However, when used via the GUI in the pipeline editor, they will both default to true in order to drive new users in that direction.

I have tested by publishing to my own copy, but can re-test in dev too.

Please note I have also deleted the task.loc.json file as this file seems like an unnecessary overhead to maintain and is not required to publish a task. However, if there is a reason you have it, then I can add it back in and make the updates manually.

mericstam commented 2 months ago

I saw the discussion here - we're currently making use of the separate WIF service connections for backend and environment, so it looks like this change would break that set-up.

Not having things timeout after 10 minutes would be a big improvement, but this is a breaking change for us and we'd have to reorganize/rethink our service connections and their permissions.

Has there been any thought of making a new version and having this change there rather than in V4?

Hi. I guess we could do that or we could add a parameter to revert to the previous behaviour. Unfortunately, as discussed here. Although this task supports two service connections, the reality is that Terraform CLI and the providers are not built to support multiple identities in a secure way (i.e. with env vars). It is more luck than design that it works for this task in my view.

The ultimate solution to support this is to update the Terraform CLI to support alternative env vars for backend and / or Azure CLI.

I would prefer not revert changes if possible. I am not sure a V5 is the best approach either, that means we need to support both V4 and V5 going forward, like support for node v20 that is around the corner. That would mean we are splitting the codebase and then something like v4.1 (or v6) and a v5.1(or v7) for node 20 support. I think the best way is to add a toggle to enable/disable the changes, or what do you think @jaredfholgate ?

Yes, I agree the flag is a better approach. I will look into making an update. I'm thinking we should support env vars for client secret anyway, so I will see if I can make it a bit more generic and agnostic of auth type. I think we'll need to default to the new behaviour for WIF given the issues reported, but default to existing behaviour for other auth types. I will be careful to ensure changes are backwards compatible whatever the case. I am also reaching out to the product group around proper support for multiple identities in the azurerm backend. And also azure cli support. This will obviously come later, but will consider this may come for any changes here.

@mericstam I have made the updates as discussed and added some unit tests. I took the opportunity to add the Azure AD auth setting too.

Both new settings default to existing behaviour when not specified, so this should not be a breaking change for users.

However, when used via the GUI in the pipeline editor, they will both default to true in order to drive new users in that direction.

I have tested by publishing to my own copy, but can re-test in dev too.

Please note I have also deleted the task.loc.json file as this file seems like an unnecessary overhead to maintain and is not required to publish a task. However, if there is a reason you have it, then I can add it back in and make the updates manually.

Yes, correct me if I am wrong but *.loc.json is used for langauage for localization? I think it is not needed as we only have english language support anyway.

jaredfholgate commented 1 month ago

I saw the discussion here - we're currently making use of the separate WIF service connections for backend and environment, so it looks like this change would break that set-up.

Not having things timeout after 10 minutes would be a big improvement, but this is a breaking change for us and we'd have to reorganize/rethink our service connections and their permissions.

Has there been any thought of making a new version and having this change there rather than in V4?

Hi. I guess we could do that or we could add a parameter to revert to the previous behaviour. Unfortunately, as discussed here. Although this task supports two service connections, the reality is that Terraform CLI and the providers are not built to support multiple identities in a secure way (i.e. with env vars). It is more luck than design that it works for this task in my view.

The ultimate solution to support this is to update the Terraform CLI to support alternative env vars for backend and / or Azure CLI.

I would prefer not revert changes if possible. I am not sure a V5 is the best approach either, that means we need to support both V4 and V5 going forward, like support for node v20 that is around the corner. That would mean we are splitting the codebase and then something like v4.1 (or v6) and a v5.1(or v7) for node 20 support. I think the best way is to add a toggle to enable/disable the changes, or what do you think @jaredfholgate ?

Yes, I agree the flag is a better approach. I will look into making an update. I'm thinking we should support env vars for client secret anyway, so I will see if I can make it a bit more generic and agnostic of auth type. I think we'll need to default to the new behaviour for WIF given the issues reported, but default to existing behaviour for other auth types. I will be careful to ensure changes are backwards compatible whatever the case. I am also reaching out to the product group around proper support for multiple identities in the azurerm backend. And also azure cli support. This will obviously come later, but will consider this may come for any changes here.

@mericstam I have made the updates as discussed and added some unit tests. I took the opportunity to add the Azure AD auth setting too. Both new settings default to existing behaviour when not specified, so this should not be a breaking change for users. However, when used via the GUI in the pipeline editor, they will both default to true in order to drive new users in that direction. I have tested by publishing to my own copy, but can re-test in dev too. Please note I have also deleted the task.loc.json file as this file seems like an unnecessary overhead to maintain and is not required to publish a task. However, if there is a reason you have it, then I can add it back in and make the updates manually.

Yes, correct me if I am wrong but *.loc.json is used for langauage for localization? I think it is not needed as we only have english language support anyway.

@mericstam Any luck with getting this deployed to test / prod? Thanks

mericstam commented 1 month ago

I saw the discussion here - we're currently making use of the separate WIF service connections for backend and environment, so it looks like this change would break that set-up.

Not having things timeout after 10 minutes would be a big improvement, but this is a breaking change for us and we'd have to reorganize/rethink our service connections and their permissions.

Has there been any thought of making a new version and having this change there rather than in V4?

Hi. I guess we could do that or we could add a parameter to revert to the previous behaviour. Unfortunately, as discussed here. Although this task supports two service connections, the reality is that Terraform CLI and the providers are not built to support multiple identities in a secure way (i.e. with env vars). It is more luck than design that it works for this task in my view.

The ultimate solution to support this is to update the Terraform CLI to support alternative env vars for backend and / or Azure CLI.

I would prefer not revert changes if possible. I am not sure a V5 is the best approach either, that means we need to support both V4 and V5 going forward, like support for node v20 that is around the corner. That would mean we are splitting the codebase and then something like v4.1 (or v6) and a v5.1(or v7) for node 20 support. I think the best way is to add a toggle to enable/disable the changes, or what do you think @jaredfholgate ?

Yes, I agree the flag is a better approach. I will look into making an update. I'm thinking we should support env vars for client secret anyway, so I will see if I can make it a bit more generic and agnostic of auth type. I think we'll need to default to the new behaviour for WIF given the issues reported, but default to existing behaviour for other auth types. I will be careful to ensure changes are backwards compatible whatever the case. I am also reaching out to the product group around proper support for multiple identities in the azurerm backend. And also azure cli support. This will obviously come later, but will consider this may come for any changes here.

@mericstam I have made the updates as discussed and added some unit tests. I took the opportunity to add the Azure AD auth setting too. Both new settings default to existing behaviour when not specified, so this should not be a breaking change for users. However, when used via the GUI in the pipeline editor, they will both default to true in order to drive new users in that direction. I have tested by publishing to my own copy, but can re-test in dev too. Please note I have also deleted the task.loc.json file as this file seems like an unnecessary overhead to maintain and is not required to publish a task. However, if there is a reason you have it, then I can add it back in and make the updates manually.

Yes, correct me if I am wrong but *.loc.json is used for langauage for localization? I think it is not needed as we only have english language support anyway.

@mericstam Any luck with getting this deployed to test / prod? Thanks

Hi unfortunatly not but we have a way forward. I will work on this tonight.

robbert-nlo commented 1 month ago

Hello @mericstam,

Do you have any idea when this PR will be merged and an updated extension will be published?

It's currently blocking our WIF deployment for quite some time. I was hoping to avoid building a custom version of the extension...

Thanks

mericstam commented 3 weeks ago

Hi @jaredfholgate . Finally got a new ado build backend working with the correct authentication. updated this branch with new location. dev and test version deployed from this branch.

mericstam commented 2 weeks ago

version 0.1.26 deployed to prod feed

robbert-nlo commented 1 week ago

Thanks @mericstam and @jaredfholgate ! The workaround in the new version works.

One comment though: shouldn't this PR be merged now that we have a PRD release with it included? It appears that the main branch and release are now out of sync.

mericstam commented 1 week ago

Hi @robbert-nlo,

We try to use standard github flow as much as possible. we deploy from feature branch , and when all looks good, after a couple of days we merge. The reason for this is we can keep the previous version unaltered in main as a way to rollback.