Performance / Scaling? - Githubissues

shaunco commented 3 years ago

Describe the Change You Would Like

Details on performance or scaling for large deployments would be a great addition to the docs. For example, given something like the Deployment of Vault in three Availability Zones (OSS) reference design:

What is the maximum number of backends/endpoints (given that each creates a go routine)?
What is the suggested additional memory/CPU needed per backend/endpoint (go routines cost 4 to 4.5kb of memory each)?
Does this plugin spin up 1 refresher per backend per Vault server instance?
- If so, does that mean a backend with a 60 seconds is actually being refreshed 3 times per 60 seconds (once on each Vault instance)?
What is the suggested maximum number of tokens per backend/endpoint?
Possible paths for moving beyond the suggestions above?
Performance/scaling tips for new custom providers (outside of OIDC/OAuth2 Custom)
- For example, why does the Microsoft Azure AD provider need a new backend per AAD tenant instead of just having a single backend that swaps the tenant in the URL for each token at request/refresh/access time? This decision means that each AAD tenant backend/endpoint gets its own scheduler go routine instead of a single Microsoft AAD backend that can handled multiple tenants.

Totally unrelated to this request

Fantastic job on this Vault plugin! Looking forward to contributing.

DrDaveD commented 3 years ago

I don't know the answers to most of your questions, but regarding the refresher, checkout recently merged but not yet released PR #43.

shaunco commented 3 years ago

I don't know the answers to most of your questions, but regarding the refresher, checkout recently merged but not yet released PR #43.

Looks like a good change!

Would love insights in to why the Microsoft AAD tenant is a endpoint config option instead of a "Credential exchange option" (many go routines refreshing for AAD vs 1 go routine refreshing for AAD).

impl commented 3 years ago

Hey @shaunco,

Thanks for these good questions. I think this probably needs to be documented somewhere for posterity, but I'll give you some general thoughts first.

What is the maximum number of backends/endpoints (given that each creates a go routine)?

External plugins (those not built-in to the Vault binary), including this one, run as a separate OS process using HashiCorp's go-plugin system. These are significantly heavier than Goroutines. You can realistically run tens of thousands of hundreds of thousands of Goroutines in a single process without encountering problems on even crappy hardware. But you're going to hit OS limitations when running thousands of processes.

A good area where a lot of research has been done into this is PostgreSQL. PostgreSQL forks a process per connection. When you start to hit OS limits, you're forced to use things like e.g. PgBouncer (not a bad solution, just a required one for that model). Vault doesn't have any sort of pooling mechanism for plugins -- and I don't think such a thing would really make sense -- so just keep in mind that you have to deal with that process overhead for each backend.

What is the suggested additional memory/CPU needed per backend/endpoint (go routines cost 4 to 4.5kb of memory each)?

Again, this is a place I'd be much more concerned about processes than Goroutines. Up to a point, probably anywhere less than 100 mounted external plugins, and of course depending on what they're doing exactly, I probably wouldn't really worry about additional memory/CPU constraints. If you're getting above that, some benchmarking may be in order. I haven't done any myself to date as we're only using a handful of these plugins mounted at the same time in our systems.

Does this plugin spin up 1 refresher per backend per Vault server instance?

When you run Vault in HA, only one server works with the data store at a time, so there is only going to be one instance of the plugin running per engine mount at a time. The other servers are in standby. Vault does not support any sort of multi-master configuration as far as I know.

There should not be any race condition with refreshing credentials. Internally, the plugin acquires a lock on a subset of the keyspace when it refreshes. If you find a race condition where a credential is being refreshed multiple times, it's a bug!

What is the suggested maximum number of tokens per backend/endpoint?

The refreshing mechanism, if you choose to use it, works by enumerating every token that is currently stored. This is generally a fast operation even for large amounts of data, but it's really dependent upon the storage backend you choose and how you've tuned it. For example, if you use GCS instead of Consul, network latency may be more significant of a factor for you. You'll see this in the KV secrets engine, too, though.

Another factor is that the refresher is part of a pool of 16 Goroutines that do the refresh work against the OAuth API in parallel. These Goroutines will apply backpressure to the enumeration operation, so two factors to consider here are:

The duration of your access tokens (i.e., the frequency at which things will need to be refreshed). If all of your access tokens expire after 1 minute, every one will need to be refreshed every time the refresher runs. If they expire after 24 hours, you'll likely be fine even with a huge number of tokens.
Your API's latency. If your OAuth API is quite slow, this can cause the refresher to also slow down.

Sorry I don't have a hard number for you here. I would actually like to do some benchmarks on this at some point. Might be something fun to add to the test suite here at some point.

Possible paths for moving beyond the suggestions above?

I have a couple of thoughts here that I think I've briefly discussed before with @DrDaveD, but happy to take more feedback.

One of the big issues we encounter is natural lapsing of tokens, e.g. from users disconnecting their upstream accounts. So to add to our existing tuning options I'd propose the following:

tune_auto_delete_after_expired_seconds: The refresher will automatically permanently remove tokens that have been expired for a certain amount of time and can no longer be refreshed for any reason.
tune_auto_delete_after_failed_attempts: The refresher will automatically permanently remove tokens that could not be refreshed because the upstream API rejected the refresh after some number of attempts. Maybe broken out by HTTP status code or specific OAuth error? Not sure.
tune_refresh_expiry_delta: Right now this is set to 10 seconds and can't be overridden in the refresher. We could probably make it longer to batch up more tokens at once to go with a longer interval between refresh runs.

Any other options you'd like to see? Happy to add them.

I'm also looking at adding support for multiple providers per plugin instance. I think this should be pretty easy to do now that our code is more rigorously structured. When you write a credential, you'd just choose which provider you want to use for it. Would this ease some of your concerns about running multiple engine mounts of this plugin?

For example, why does the Microsoft Azure AD provider need a new backend per AAD tenant instead of just having a single backend that swaps the tenant in the URL for each token at request/refresh/access time?

Because I didn't understand how Azure AD worked when I wrote the provider (we don't use Azure here). :) #52 fixes this, so you can specify the tenant per credential if you like.

shaunco commented 3 years ago

@impl - Thanks for the awesomely detailed answer!

On the goroutines vs processes, it was my understanding that Vault only executes one instance of each plugin (by SHA/path) regardless of the number of mount points, rather than one instance of each plugin for each mount point. It is likely that I am totally wrong on that, and will do some further digging through Vault code, as the Vault docs are not explicit one way or the other.

on #52, @vavsab is on our team at @mapped. We'll continue to contribute where we can and really appreciate you and the team working with us to accept these PRs!

impl commented 3 years ago

You can check out #56 for where my head is at so far with the performance tuning options. Would love any feedback from y'all on it.

puppetlabs / vault-plugin-secrets-oauthapp

Performance / Scaling? #49

Describe the Change You Would Like

Totally unrelated to this request