hashicorp / vault-plugin-secrets-azure

Vault Azure Secrets plugin
Mozilla Public License 2.0
26 stars 19 forks source link

Some generated creds not being accepted by the Azure API #23

Closed MMollyy closed 5 years ago

MMollyy commented 5 years ago

Hi all!

Not sure what information you might be needing: We are running vault version1.0.3 and are using the engine to generate creds on existing applications across four different tenants/subscriptions. The vault itself is able to generate the creds.

But we are experiencing issues with some of the creds being generated on those applications being invalid/non-usable. The creds being generated are for an application which is used by hashicorp terraform, and the values are placed into the environment: https://www.terraform.io/docs/providers/azurerm/auth/service_principal_client_secret.html

The creds generated fail about 3/4 out of 10 times, and we're using the creds in our CICD pipelines. Which is making the pipelines very unreliable since we are unable to pinpoint what the issue even is. Microsoft support tells me that the client_secret isn't URL-encoded. Which i've also raised a question about with the terraform team: https://github.com/terraform-providers/terraform-provider-azurerm/issues/3237

I've tested the creds manually as well using Postman, and the result coming from Postman is:

{
    "error": "invalid_client",
    "error_description": "AADSTS7000215: Invalid client secret is provided.\r\n
    Trace ID: 92af1d95-8143-4972-8fb5-2ff0422d5400\r\n
    Correlation ID: 78c036fe-89c8-4993-a1da-f6d3454174d6\r\n
    Timestamp: 2019-04-15 11:59:17Z",
    "error_codes": [
        7000215
    ],
    "timestamp": "2019-04-15 11:59:17Z",
    "trace_id": "92af1d95-8143-4972-8fb5-2ff0422d5400",
    "correlation_id": "78c036fe-89c8-4993-a1da-f6d3454174d6"
}

So it is in fact the secret that is not valid, i verified by testing the creds from a succesful in postman as well. If the secret was valid and terraform would fail, I would definatly expect this to be a terraform issue. Which is not the case. Hence why I am creating a ticket here in parallel, as I feel like it might be the secrets engine generating secrets incorrectly.

kalafut commented 5 years ago

Hi. Can you tell me what the TTL is for the credentials is and roughly how many have been generated (e.g. az ad app show --id <application_id> | grep keyId | wc? I ask because there are limits on the number of credentials for a single app, so it's useful to keep the TTL short. Also, what is the duration between when the credential is generated and when it is used? You may need to allow time for the change on Azure to be widely propagated.

MMollyy commented 5 years ago

There won't be any secrets to be found: We revoke each secret after it's been used. Thankfully there is a revoke functionality, and it works. :)

And we set the TTL to 1h. So I can generate new pipeline runs that might fail, and we might then be able to see the secrets. But then we need to set up a call or something, because they will expire in 60m.

As for what the duration is between creation and usage: We use a piece of backoff code, that incrementally increases. Our pipeline checks if the credential is usable, if not, it backoffs and waits until it is usable. Output from our code doing the backoff:

2019-04-15 19:46:24 runner-9cc1bf12-project-1662-concurrent-0 adal-python[10] INFO 1f80693b-c69b-480c-92a6-ff925fb6d3c9 - TokenRequest:Getting token with client credentials.
2019-04-15 19:46:24 runner-9cc1bf12-project-1662-concurrent-0 backoff[10] INFO Backing off _get_service_principal_credentials(...) for 0.6s (msrest.exceptions.AuthenticationError: , AdalError: Get Token request returned http error: 401 and server response: {"error":"invalid_client","error_description":"AADSTS7000215: Invalid client secret is provided.\r\nTrace ID: 8a38d243-3367-42f9-b63c-aa4b16ee4d00\r\nCorrelation ID: 1f80693b-c69b-480c-92a6-ff925fb6d3c9\r\nTimestamp: 2019-04-15 19:46:24Z","error_codes":[7000215],"timestamp":"2019-04-15 19:46:24Z","trace_id":"8a38d243-3367-42f9-b63c-aa4b16ee4d00","correlation_id":"1f80693b-c69b-480c-92a6-ff925fb6d3c9"})
2019-04-15 19:46:25 runner-9cc1bf12-project-1662-concurrent-0 adal-python[10] INFO 9be30e10-fe2a-4881-a637-7487dec9d73d - TokenRequest:Getting token with client credentials.
2019-04-15 19:46:25 runner-9cc1bf12-project-1662-concurrent-0 backoff[10] INFO Backing off _get_service_principal_credentials(...) for 0.7s (msrest.exceptions.AuthenticationError: , AdalError: Get Token request returned http error: 401 and server response: {"error":"invalid_client","error_description":"AADSTS7000215: Invalid client secret is provided.\r\nTrace ID: 35b2a1e8-a1b2-4033-af53-903d98374e00\r\nCorrelation ID: 9be30e10-fe2a-4881-a637-7487dec9d73d\r\nTimestamp: 2019-04-15 19:46:25Z","error_codes":[7000215],"timestamp":"2019-04-15 19:46:25Z","trace_id":"35b2a1e8-a1b2-4033-af53-903d98374e00","correlation_id":"9be30e10-fe2a-4881-a637-7487dec9d73d"})
2019-04-15 19:46:26 runner-9cc1bf12-project-1662-concurrent-0 adal-python[10] INFO a8a00eea-fe9f-4f6e-b637-7ab3b02e70ee - TokenRequest:Getting token with client credentials.
2019-04-15 19:46:26 runner-9cc1bf12-project-1662-concurrent-0 backoff[10] INFO Backing off _get_service_principal_credentials(...) for 3.8s (msrest.exceptions.AuthenticationError: , AdalError: Get Token request returned http error: 401 and server response: {"error":"invalid_client","error_description":"AADSTS7000215: Invalid client secret is provided.\r\nTrace ID: 09f5eecb-8d89-4282-a03c-d9b0bc9f4a00\r\nCorrelation ID: a8a00eea-fe9f-4f6e-b637-7ab3b02e70ee\r\nTimestamp: 2019-04-15 19:46:26Z","error_codes":[7000215],"timestamp":"2019-04-15 19:46:26Z","trace_id":"09f5eecb-8d89-4282-a03c-d9b0bc9f4a00","correlation_id":"a8a00eea-fe9f-4f6e-b637-7ab3b02e70ee"})
2019-04-15 19:46:30 runner-9cc1bf12-project-1662-concurrent-0 adal-python[10] INFO 1be1caf2-318b-46c2-bb37-504963acb1c4 - TokenRequest:Getting token with client credentials.
2019-04-15 19:46:30 runner-9cc1bf12-project-1662-concurrent-0 backoff[10] INFO Backing off _get_service_principal_credentials(...) for 4.9s (msrest.exceptions.AuthenticationError: , AdalError: Get Token request returned http error: 401 and server response: {"error":"invalid_client","error_description":"AADSTS7000215: Invalid client secret is provided.\r\nTrace ID: d730a927-3281-430c-ae10-7ea152e14600\r\nCorrelation ID: 1be1caf2-318b-46c2-bb37-504963acb1c4\r\nTimestamp: 2019-04-15 19:46:30Z","error_codes":[7000215],"timestamp":"2019-04-15 19:46:30Z","trace_id":"d730a927-3281-430c-ae10-7ea152e14600","correlation_id":"1be1caf2-318b-46c2-bb37-504963acb1c4"})
kalafut commented 5 years ago

Hi. A couple things to check. First, are you able to reproduce this at all reliably making calls directly to Vault (e.g. with the CLI) and then testing the secrets? It would be good to know whether this is present only within the CI pipeline, and even if not, it's easier to troubleshoot manually.

Also, given one of the secrets that doesn't work which should be active (not revoked), do you see a credential record in Azure (either via the portal or with the az command I mentioned before) with the same creation timestamp (+/- a second or so). I'm interested in whether there is some way that Vault is perceiving the cred create operation as a success with Azure, but the credential isn't actually being added.

MMollyy commented 5 years ago

On it! Hope I will be able to test it today. Because yes, I can execute all of that manually and reliably. I will be back.

MMollyy commented 5 years ago

Alright, so I have been able to test this manually. Execute the vault creds, test the secret in postman, and run the az command to see if the secret is created. Out of the few times it failed this is the result: Postman test: Failed Pipeline: Failed (including the backoff code) az command: Secret exists Portal: Secret exists

I'm not sure how I would check the +/- a second or so difference. Because:

vault read azure/test/creds/923-lifecycle-testing
Key                Value
---                -----
lease_id           azure/test/creds/923-lifecycle-testing/F9vVXJu3xNEY3jFlFyZOnWeN
lease_duration     1h
lease_renewable    true
client_id          IDHERE
client_secret      2c9c3bc0-70b4-80e2-d9a8-8bdcacaae987

There is no timestamp when the vault generates a secret.

 "passwordCredentials": [
    {
      "additionalProperties": null,
      "customKeyIdentifier": null,
      "endDate": "2029-04-16T09:23:09.835599+00:00",
      "keyId": "ffffff78-8428-21fa-6d9c-145e86025cd5",
      "startDate": "2019-04-19T09:23:09.835599+00:00",
      "value": null
    }
  ],

And that's the az output for it.

Edit: I did make sure to disable the revoke command. And I could verify there is only 1 secret on the application during each test. So it's guaranteed that vault actually created the secret.

kalafut commented 5 years ago

hmm. I'm not seeing anything obviously wrong. I plan to do some stress testing on this to see if I encounter errors. In the mean time, you may consider having Azure Secrets generate dynamic applications, if you're able to properly configure the roles (some resources cannot be set up that way which is what prompted the addition of the password generation for existing apps).

As for the MS response about escaping the client secret, that just seems really odd. There is no variation in the format, they're all UUID. So none of them should work if escaping was an issue I think?

MMollyy commented 5 years ago

In the mean time, you may consider having Azure Secrets generate dynamic applications, .. <

I wish that was an option. We need to assign multiple roles to the same service principal in order for their permissions to be complete across different resource groups. Otherwise applications don't have full functionality, and if we put it in 1 role they have too much permissions on certain critical resources. The engine does not support assigning different roles to the same service principal. Might be a good feature actually.

MS has verified that the UUIDs generated are valid. I sent them some examples of failing creds. So it's definately not the format yeah.

kalafut commented 5 years ago

The engine does not support assigning different roles to the same service principal.

It does! The roles parameter is a list of roles that will be assigned: https://www.vaultproject.io/docs/secrets/azure/index.html#azure-roles

MMollyy commented 5 years ago

Ah I see. Not sure how both my colleague and I missed that part.

I'll be putting some time into testing this functionality and seeing how I'm gonna put that in our pipeline code.

MMollyy commented 5 years ago

I like it!

This might make the rest of our code somewhat simpler as well :) Unfortunate that the other way doesn't work, but this one sure does! Creates the application/sp and secrest & deletes everything with the revoke.

I'll close this issue with this. :)

kalafut commented 5 years ago

I'm glad it will work for you. It's the preferred approach if you're able to access everything solely through roles assignments. That said, I do still plan to abuse the other method a bit to see if I can recreate the issue.

MMollyy commented 5 years ago

@kalafut I do have a question though: Is the code used for creating the secrets the same in both dynamic and pre-created scenarios?

If you have any questions regarding our setup for this testing, I would be glad to provide info.

kalafut commented 5 years ago

It's the same. You'll get back a client_id and and client_secret, but in the pre-created case the client_id will always be the same (the one you pre-created).

negnov commented 4 years ago

Hit the same issue. Switching to generating dynamic applications also helped.

mdgreenfield commented 4 years ago

@kalafut did you get a chance to look into this issue with existing service principals? I am also running into this scenario where sometimes the client secret returned is invalid.

In my scenario the service principal needs access to the AAD Graph for listing AAD groups and so AFAIK I must use an existing service principal in order to assign Directory.ReadWrite.All to the service principal.

mdgreenfield commented 4 years ago

I was able to chat with an Azure rep this afternoon and the issue is that the client credentials returned by Azure must be globally replicated. When requests using those credentials are authorized the request is load balanced to any of their clusters/datacenters. So a client/user might get lucky but the odds are against them.

I was able to reproduce the issue locally by using the credentials returned from Vault and issuing a curl request:

$ curl -s https://login.microsoftonline.com/<tenant_id>/oauth2/v2.0/token --data-urlencode "client_id=$client_id" --data-urlencode "client_secret=$client_secret" -d scope=https%3A%2F%2Fgraph.microsoft.com%2F.default -d grant_type=client_credentials -v

(you can see that I also tried checking with the --data-urlencode flag just in case)

Doing this, it appears that rarely a request is immediately successful. I found that it took 10-30s before the request was successful and even then I still noticed some responses were returning a 401. Only after several minutes did I seemingly observe no HTTP 401 responses.

alizdavoodi commented 4 years ago

The reasoning @mdgreenfield wrote that was precisely my problem. I put "sleep" in the pipeline, and it solved the issue.