hashicorp / terraform-provider-vault

Terraform Vault provider
https://www.terraform.io/docs/providers/vault/
Mozilla Public License 2.0
459 stars 540 forks source link

auth/mount: fetch single vault resource on read #2145

Closed fairclothjm closed 7 months ago

fairclothjm commented 7 months ago

Description

This changes the READ operations in vault_auth_backend and vault_mount to use the GET /sys/auth/:path and GET /sys/mounts/:path APIs respectively. Previously, these resources were calling LIST which could result in a substantially higher CPU and memory footprint for the provider in cases where a given Vault server has a large number of secret/auth mounts.

At this time, there are no helpers for these API paths in the Vault api package. See https://github.com/hashicorp/vault/pull/25499 which proposes to add them.

TF Config used in the performance tests ``` terraform { required_providers { vault = { source = "hashicorp/vault" version = "~> 3.23.0" } } } provider "vault" { address = "http://localhost:8200" } variable "mount_name_count" { type = number default = 1000 } resource "vault_mount" "kvv2-example" { count = var.mount_name_count path = "kv-mount${count.index}" type = "kv-v2" options = { version = "2" type = "kv-v2" } description = "This is an example KV Version 2 secret engine mount" } resource "vault_auth_backend" "userpass-example" { count = var.mount_name_count type = "userpass" path = "userpass${count.index}" tune { max_lease_ttl = "90000s" listing_visibility = "unauth" } description = "This is an example userpass auth mount" } ```

Details on the CPU performance improvements

As expected, the CPU profile shows the biggest improvements in the Provider's READ operations which were spending most of the CPU time listing and decoding mount data.

pprof CPU profile flame graph

Before:

before-cpu

After:

after-cpu

Additionally, from the CPU profile we can see a big difference in the time spent in the call to mallocgc:

Before:

      flat  flat%   sum%        cum   cum%
     2.33s  2.94% 49.05%      8.15s 10.29%  runtime.mallocgc

After:

      flat  flat%   sum%        cum   cum%
     0.07s  0.79% 88.25%      0.27s  3.05%  runtime.mallocgc

So let's take a look at the pprof memory profile...

Details on the Memory performance improvements

Interestingly, the pprof memory profile does not shed any light on the performance improvements that we would predict based on the previous CPU profile analysis. However, both before and after point to the AWS SDK init functions as being very memory hungry.

pprof Memory profile

Before:

(pprof) top
Showing nodes accounting for 5680.34kB, 100% of 5680.34kB total
Showing top 10 nodes out of 33
      flat  flat%   sum%        cum   cum%
 1544.18kB 27.18% 27.18%  1544.18kB 27.18%  github.com/aws/aws-sdk-go/aws/endpoints.init
 1026.81kB 18.08% 45.26%  1026.81kB 18.08%  regexp.onePassCopy
  544.67kB  9.59% 54.85%   544.67kB  9.59%  regexp/syntax.(*compiler).inst (inline)
  516.01kB  9.08% 63.93%   516.01kB  9.08%  regexp/syntax.appendRange
  512.20kB  9.02% 72.95%   512.20kB  9.02%  runtime.malg
  512.16kB  9.02% 81.97%   512.16kB  9.02%  github.com/hashicorp/terraform-provider-vault/vault.approleAuthBackendLoginResource
  512.16kB  9.02% 90.98%   512.16kB  9.02%  github.com/hashicorp/terraform-provider-vault/vault.getCommonManagedKeysSchema
  512.16kB  9.02%   100%   512.16kB  9.02%  github.com/hashicorp/terraform-provider-vault/vault.identityOIDCOpenIDConfigDataSource

After:

(pprof) top
Showing nodes accounting for 8218.13kB, 84.25% of 9754.22kB total
Showing top 10 nodes out of 65
      flat  flat%   sum%        cum   cum%
 3075.29kB 31.53% 31.53%  3075.29kB 31.53%  github.com/aws/aws-sdk-go/aws/endpoints.init
 1028.88kB 10.55% 42.08%  1028.88kB 10.55%  regexp.onePassCopy
  520.04kB  5.33% 47.41%   520.04kB  5.33%  go.opencensus.io/stats/view.NewMeter
  517.33kB  5.30% 52.71%   517.33kB  5.30%  regexp/syntax.(*compiler).inst
  514.63kB  5.28% 57.99%   514.63kB  5.28%  cloud.google.com/go/monitoring/apiv3/v2/monitoringpb.init
  512.88kB  5.26% 63.24%   512.88kB  5.26%  unicode.map.init.0
  512.56kB  5.25% 68.50%   512.56kB  5.25%  encoding/json.typeFields
  512.20kB  5.25% 73.75%   512.20kB  5.25%  runtime.malg
  512.16kB  5.25% 79.00%   512.16kB  5.25%  github.com/hashicorp/terraform-provider-vault/vault.azureAuthBackendRoleResource
  512.16kB  5.25% 84.25%   512.16kB  5.25%  github.com/hashicorp/terraform-provider-vault/vault.tokenResource

Relates OR Closes #0000

Checklist

fairclothjm commented 7 months ago

Once https://github.com/hashicorp/vault/pull/25499 is merged and the API is tagged we can pull that in

pree commented 6 months ago

Just encountered an issue after upgrading the vault provider that my data resource for vault_auth_backend now errors with * permission denied.

I had to extend the policy my policy with sudo now for this to work:

path "sys/auth*" {
  capabilities = ["read", "list", "sudo"]
}

This seems like overkill, especially as it's not documented in the vault API docs. Also, the lack of a breaking change documentation isn't nice either ...

fairclothjm commented 6 months ago

@pree Hello, I am sorry you are having trouble! Thanks for pointing this out. We did mention policy changes in the Changelog and we have published a v4.0.0 Upgrade Guide but it looks like we missed this nuance.

I think we may be able get a patch in to remove the sudo requirement, however you will still need to update your policy path to use sys/mounts/auth/*.

pree commented 6 months ago

Thanks for the reply @fairclothjm. I would love a # Breaking Change header in the Changelog to make this more clear. Removing the sudo requirement would be the right approach imho.

kkronenb commented 6 months ago

I have started to play around with this version and so far, it is a night and day difference in our environment. Significantly less CPU (100%->40%) and roughly 3x faster to run though all of our resources.

Thank you so much and kudos @fairclothjm

fairclothjm commented 6 months ago

Thanks and glad to hear that @kkronenb !