Closed armon closed 8 years ago
@sean- and I were thinking more about this and we can probably eliminate all ACL caching in favor of this if it's full ACL replication. We'd create a new endpoint that supports blocking queries and provides a complete snapshot of ACLs (SHA256 hash of token and policy), or we could make a fancier API that can intelligently send updates after an initial full dump. Any server with the replicated ACLs can hash an incoming token and look up the policy, but won't have any other access to the actual tokens. If we do full replication, we can eliminate the ACL cache code completely, which would massively simplify that code path.
We can still support the old ACL endpoint to fetch a policy for a token, so for upgrades you just need to upgrade your ACL DC to Consul 0.7 before you upgrade any of your other servers.
@slackpad I don't think we want to do that. In the case where you have tons of ACL tokens, you probably don't want to pre-load them all, and the cache is a pretty huge performance win. Given that the logic is already there, I don't see a compelling reason to remove it.
@armon are you thinking we'd still replicate the full un-compiled policy set from the servers but retain the cache of compiled policies (so you could still look up any ACL in the event of a partition)?
@slackpad I'm not exactly sure what you mean, but I meant this would act as a second cache tier, and would not affect the existing caches.
@armon I think the main thing I'm stuck on is what subset of ACLs would you use to warm the cache?
@slackpad I think there is basically 2 different caches. Cache 1 is the existing LRU/2Q of fixed size. Its keyed on the actual token and cannot be disabled. Cache 2 would be new, keyed on the hash of the id, and would contain the full ACL set.
The idea is ACLLookup(token) -> Cache1(token) || Cache2(hash(token)) || ACLResolve(token)
Ok I see now that you'd save the token hash via Cache1, and given how critical this is in the request path that makes sense.
FYI: now that #1873 has been committed (there is a side-effect of clients prewarming the ACL caches of all servers), some of the necessity for pre-warming ACL caches has been elided. This is not a mirror of the ACLs, so in the event of a partition between a remote DC and the authoritative ACL DC, it's possible that unexercised ACLs would result in a cache-miss, but this does round out the common case.
Currently the ACL tokens are fetched on demand from the ACL datacenter and then cached in an LRU. This is used along with "acl_down_policy=extend-cache" to allow operation when the ACL datacenter is offline but the last known policy is cached.
Support for pre-warming the ACL cache ensures that most (if not all) ACLs are cached making the extended cache behavior even more resilient in the face of an ACL data center loss. This also reduces the performance impact of a cache miss, as it makes them much less likely.