pierresouchay commented 6 years ago

Consul Template version

consul-template v0.19.4 (68b1da2)

Expected behavior

Consul template should rate limit its calls to avoid DOSing the Consul Servers.

Actual behavior

When using consul-template to watch all services on a large cluster (1k+ kind of services, more than 4500 nodes, Consul Indexes almost always change, especially on /catalog/services), consul-template very quickly DOS the Consul Servers by requesting far too fast to its local agent data and its local agent fowarding it very quickly to Consul Servers. In one of our configuration, we then see more than 500Mb/s data between Consul-Agent and Consul-Server.

slackpad commented 6 years ago

Hi @pierresouchay we recommend deduplication mode if you have many instances all rendering the same template. Also, there's an issue with the extra writes to the catalog that's about to get a fix in https://github.com/hashicorp/consul/pull/3845.

pierresouchay commented 6 years ago

Hello @slackpad ,

We already applied those patches https://github.com/hashicorp/consul/pull/3845 and https://github.com/hashicorp/consul/pull/3642

Basically, when watching many services (for a load balancer), consul template consumes more than 400 Mbits/sec downloading data from its local agent and the local consul agent consumes around 500 Mbits/sec (continuously) to its Consul servers.

We see this when watching /catalog/services and iterating over the results (event with filtering results based on tags / other parameters). I suspect that many services are queried continuously, this is especially visible when Consul-template uses its local agent since the latency between both is close to 0.

I am preparing a patch to add a delay in case of a successful query, but I think we are not the only ones having this issue. While deduplication mode might help, unfortunately, it is limited by the limitations of size of the KV and looks to me like a poor workaround.

I am pretty sure large clusters owners would like to have a way to limit delay before watching again a service that had just been modified. This is especially true if you want to wait for instance at least one second before regenerating your file.

I think we might keep this issue open, don't you think ?

Regards

slackpad commented 6 years ago

I'll reopen this to take a deeper look.

Basically, when watching many services (for a load balancer), consul template consumes more than 400 Mbits/sec downloading data from its local agent and the local consul agent consumes around 500 Mbits/sec (continuously) to its Consul servers.

That does seem like a huge amount of churn. Have you looked into consul-template's quiescence timers at all? They are designed to let you have a hold off period to rate limit updates.

pierresouchay commented 6 years ago

That does seem like a huge amount of churn. Have you looked into consul-template's quiescence timers at all? They are designed to let you have a hold off period to rate limit updates.

Maybe we missed something, I'll check this tomorrow.

I made a basic PR with support for delaying calls, I'll keep you updated with the actual results on our cluster

Regards

pierresouchay commented 6 years ago

Here is a small benchmark where I run our Consul template against local agent and measure the bandwidth using iftop between the Consul Servers and the local Consul agent.

With existing consul-template # v0.19.4 (68b1da2)

consul { 
  # We keep this value for the benchmark as well
  wait {
    min = "5s"
    max = "10s"
  }
}

=> download rate is around 500Mbits/second

With Our patched version of Consul template

Here is the proposed change: https://github.com/hashicorp/consul-template/pull/1066

min_delay_between_updates = 100ms

consul {
  rate_limit {
    random_backoff = "33ms"
    min_delay_between_updates = "100ms"
  }
}

=> Around 400Mbits/sec

min_delay_between_updates = 1s

consul {
  rate_limit {
    random_backoff = "33ms"
    min_delay_between_updates = "1s"
  }
}

=> Around 330Mbits/sec

min_delay_between_updates = 5s

consul {
  rate_limit {
    random_backoff = "33ms"
    min_delay_between_updates = "5s"
  }
}

=> Around 85Mbits/sec

Note that it does not change the time to wait for the first rendering, it only slow down.

While the patch might be adapted or tuned, I think it would be nice to have reasonable values here. Having 1s of min_delay_between_updates = "1s" by default sounds like a good idea to me (in my patch, this is currently 100ms)

Note that it is always possible to disable to feature using enabled = false

vaLski commented 6 years ago

@pierresouchay from the results of your benchmark I can see that your patch is outperforming native quiescence timers with the same settings set (5s) by a factor of 5+ which is an astonishing gain.

I am not sure where the difference is coming from? As far as I can understand you basically re-implemented the native quiescence timers wait {min = ..., max = ...} Can you please explain?

I am experiencing issue very similar to the one you described

5 x consul servers
~7000+ clients with consul-template in the current DC
nginx template size 610 bytes (stored locally on each node and updare)
nginx rendered template with ips pulled from the consul KV store 32k
stale reads enabled so we can spread traffic across different consul servers
per template wait options. for this specific template.
```
wait {
min = "30s"
max = "90s"
}
```
If I calculate number of hosts 7000 size of the rendered template 32KB = 224000 KB = 218 MB/s 8 = 1750 MBit/s peak traffic. This is something that I see on the consul server nodes network graphs. I can see that whenever wait->max expires, all servers are re-rendering their templates, effectively saturating 1GBit/s up-links of the consul servers.

So the pressure here seems to be purely network related?

I have not tried to use de-duplication yet, but as far as I can understand even if I do, it will only save cpu cycles on the consul server nodes. It will be rendered and saved once in the KV store (requiring far less cpu), but still 7000k agents will try to pull the data from the resulting KV path generating the same network traffic. Or the savings in that case, are coming from the fact that the de-duplication is using compression so agents will be pulling less bytes from the KV later?

So I am thinking what might be the best possible way to scale/optimize this and back-off the network overheat away from the consul servers:

de-duplication with eventually generate gain based on compression?
using your patch?
adding non-voting servers so we can have more bw? (enterprise only, require more machines)
get the agents to somehow extract only the delta of the dynamic template data instead of pulling the entire prefix. As far as I know this is not possible as whenever watch is fired it is returning the full list of the monitored endpoint (KV prefix in my case) instead of throwing only the differences added/removed.

Any assistance will be highly appreciated.

pierresouchay commented 6 years ago

@vaLski Hello,

So the pressure here seems to be purely network related?

Yes, actually, if you use more recent Consul version (1.0.7+), this patch https://github.com/hashicorp/consul/pull/3899 really reduces the impact of many watches in heavily loaded clusters (see https://github.com/hashicorp/consul/issues/3890#issuecomment-366436108 for more detailed explanation). The wait we are doing in this current patch really get a huge boost in performance thanks to https://github.com/hashicorp/consul/pull/3899 (because, previously, if you were waiting before performing the request, you would get all the data serialised immediately while now, it really send you the data back once really modified in the cluster)

This PR in Consul Template is a simple workaround to avoid creating too many requests consecutively to the server. In our clusters for instance, just calling /v1/catalog/services performs lots of calls because we have around 3 updates/sec globally.

So basically, this patch does the following if you have Consul 1.0.7+ on a heavily loaded cluster and configure min_delay_between_updates = 5s

on /v1/catalog/services => You will get a new version of catalog every 5 secs max -> instead of downloading if every 1/3 s. I tried to solve in https://github.com/hashicorp/consul/pull/3934 (but was rejected since considered too intrusive).
on /v1/health/service/myservice => you would download the content a max frequent of 1 every 5 secs, but would really download the data when the data is really modified
on a non-existing endpoint -> you download data every 5 secs max (see https://github.com/hashicorp/consul/issues/3712 why)

All of this make it far more efficient on the network side but also preserves a lot the CPU on Consul servers (since almost no serialisation is done anymore - only when needed).

Regarding deduplication

I consider this is a pure hack implemented for wrong reasons most of the time: you have to download the content anyway and yes, you might have some gains (especially on services with many many instances, but even in our infrastructure with more than 5k nodes / DC), it is not such a big deal if clients do protect properly (by delaying watches when appropriate).

So, basically, I am not convinced by the feature.

Since we had very hard issues, we also developed another tool https://github.com/criteo/consul-templaterb which offers most of the features of consul-template but with nicer templating ERB features (and network debug features).

On our side, it allows us to create a full GUI (or Load-Balancer configurations for instance) of all our services by "ONLY" using 2Mb/s while having immediate notifications, you may run it locally very easily.

I am looking for people having large infrastructure to share knowledge (we create a Slack Chan to discuss this kind of issues with people from several companies), if you are interested joining us, send me an email to p.souchay AT criteo.com, I'll be glad to send you an invitation.

Kind regards

vaLski commented 6 years ago

@pierresouchay Thank you very much for the detailed explanation. You definitely did a great job with https://github.com/hashicorp/consul/pull/3899 as 5x improvement is quite noticeable especially in setups like yours (5k nodes/dc). I am sure we will also benefit from that optimization in future "templetizations" in our infrastructure when dealing with the service catalog.

Alas it does not work for my specific case as 3899 is optimizing the services catalog, while in my case, I am experiencing saturation while pulling and rendering data from a large KV store prefix.

Sorry if I badly hijacked your initial report as I am not sure if my case fits exactly into it, or it is the other side of the same coin.

P.S. I just found hashicorp/consul/issues/3687 which was merged in 1.0.7 which might also help in my case when consul-template start supporting compression as well.

hashicorp / consul-template

When watching all services, consul-template is DOSing the Consul agent #1065

Consul Template version

Expected behavior

Actual behavior

With existing consul-template # v0.19.4 (68b1da2)

With Our patched version of Consul template

min_delay_between_updates = 100ms

min_delay_between_updates = 1s

min_delay_between_updates = 5s

Regarding deduplication