hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Blocking Queries on ACL resources and Config Entries Unblock Too Often #6530

Open mkeeler opened 5 years ago

mkeeler commented 5 years ago

Overview of the Issue

Internally our blocking query function uses the index reported in the QueryMeta struct to determine if the index is new enough to unblock and return the data back to the API caller.

All of the blocking queries issued within agent/consul/acl_endpoint.go and in agent/consul/config_endpoint.go set this index to be whatever value comes out of the "index" table for that type in memdb. This is done instead of setting it to the value of the max modify index of the data (for listings) or simple the modify index (for single object reads).

The ramifications of this are that we may unblock a query even when the underlying data we are looking at hasn't changed but other unrelated data has.

Real world implications are that our watches on config entries are likely to cause unnecessary churn through proxy configurations for Connect. There are likely other real world performance issues too.

mkeeler commented 5 years ago

One thing to note is that if the blocking query is started before the tables index is updated then we may block waiting for the data to be updated. However when we are blocking waiting for a non-existent entry to be created this will not work as we will be woken up for every modification.

pierresouchay commented 5 years ago

@mkeeler For the non-existent entries, you may use the same mechanics as the optimization used here: https://github.com/hashicorp/consul/pull/4810 (which is not perfect if many services disappear/sec, but is globally at very good one still as "normal" updates not deleting a full service to not wake up watchers all the time). In our case, it really decreased by quite a lot the load on Consul servers because of apps waiting for "unexistent" services (ex: prometheus)