hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.37k stars 4.42k forks source link

Support ACL cache pre-warming #1419

Closed armon closed 8 years ago

armon commented 8 years ago

Currently the ACL tokens are fetched on demand from the ACL datacenter and then cached in an LRU. This is used along with "acl_down_policy=extend-cache" to allow operation when the ACL datacenter is offline but the last known policy is cached.

Support for pre-warming the ACL cache ensures that most (if not all) ACLs are cached making the extended cache behavior even more resilient in the face of an ACL data center loss. This also reduces the performance impact of a cache miss, as it makes them much less likely.

slackpad commented 8 years ago

@sean- and I were thinking more about this and we can probably eliminate all ACL caching in favor of this if it's full ACL replication. We'd create a new endpoint that supports blocking queries and provides a complete snapshot of ACLs (SHA256 hash of token and policy), or we could make a fancier API that can intelligently send updates after an initial full dump. Any server with the replicated ACLs can hash an incoming token and look up the policy, but won't have any other access to the actual tokens. If we do full replication, we can eliminate the ACL cache code completely, which would massively simplify that code path.

We can still support the old ACL endpoint to fetch a policy for a token, so for upgrades you just need to upgrade your ACL DC to Consul 0.7 before you upgrade any of your other servers.

armon commented 8 years ago

@slackpad I don't think we want to do that. In the case where you have tons of ACL tokens, you probably don't want to pre-load them all, and the cache is a pretty huge performance win. Given that the logic is already there, I don't see a compelling reason to remove it.

slackpad commented 8 years ago

@armon are you thinking we'd still replicate the full un-compiled policy set from the servers but retain the cache of compiled policies (so you could still look up any ACL in the event of a partition)?

armon commented 8 years ago

@slackpad I'm not exactly sure what you mean, but I meant this would act as a second cache tier, and would not affect the existing caches.

slackpad commented 8 years ago

@armon I think the main thing I'm stuck on is what subset of ACLs would you use to warm the cache?

armon commented 8 years ago

@slackpad I think there is basically 2 different caches. Cache 1 is the existing LRU/2Q of fixed size. Its keyed on the actual token and cannot be disabled. Cache 2 would be new, keyed on the hash of the id, and would contain the full ACL set.

The idea is ACLLookup(token) -> Cache1(token) || Cache2(hash(token)) || ACLResolve(token)

slackpad commented 8 years ago

Ok I see now that you'd save the token hash via Cache1, and given how critical this is in the request path that makes sense.

sean- commented 8 years ago

FYI: now that #1873 has been committed (there is a side-effect of clients prewarming the ACL caches of all servers), some of the necessity for pre-warming ACL caches has been elided. This is not a mirror of the ACLs, so in the event of a partition between a remote DC and the authoritative ACL DC, it's possible that unexercised ACLs would result in a cache-miss, but this does round out the common case.