hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.22k stars 4.41k forks source link

Consul DNS Performance testing #1535

Open ghost opened 8 years ago

ghost commented 8 years ago

We did performance analysis of Consul DNS, by mesuring latency of Query. We ran consul server in local host. We registered 100 services with 100 instances on consul having some random IPs, using a Python script. https://gist.github.com/basj23/a608bf7fab506030f273

On running parallel 100 DNS query, latency increases linearly with number of query. We used Go code for doing CName Lookup and analyzing latancy time. https://gist.github.com/basj23/a608bf7fab506030f273

Result can be seen at same gist.

We are not sure if this latency is caused due to parallel queries to consul DNS or due to some other reason.

Source code and result : https://gist.github.com/basj23/a608bf7fab506030f273

sodabrew commented 4 years ago

Hello! I've just done very similar performance testing with Consul 1.6.2 and using NS1's flamethrower tool. I've found a few things:

Queueing theory suggests that LIFO is a better processing model here. That under high load, the server will begin dropping requests, but the requests the server does respond to will be timely. What I'm seeing in production incidents, and is consistent with this testing, is that once Consul gets backed up on DNS requests it remains backed up, every response falls behind typical client timeout thresholds, and Consul does not recover until seconds-minutes after the query load is removed.

See this article for discussion of FIFO vs. Adaptive LIFO / CoDel: https://queue.acm.org/detail.cfm?id=2839461

And this article from AirBnB about how they've adopted the same in their SOA: https://medium.com/airbnb-engineering/building-services-at-airbnb-part-3-ac6d4972fc2d

My naïve suggestion here would be to add a timestamp to DNS requests. When preparing the response, if the timestamp is beyond a threshold, stop and move on. This will still entail reading through the FIFO queue, though, so depending on the implementation it might not allow Consul to move quickly enough through the queue to shed the oldest requests faster than the inbound request rate. So the next step would be switching from FIFO to LIFO when a response time threshold is reached - obviously this is more work.

sodabrew commented 4 years ago

@banks you wrote and @mkeeler you reviewed this one: https://github.com/hashicorp/consul/pull/4511 - looks like a LIFO interface for Serf for exactly the same reason as DNS would need to prioritize recent packets rather than FIFO!

mkeeler commented 4 years ago

Yes. Unfortunately things aren’t so simple for DNS and by extension RPC requests done behind the scenes.

Both of these APIs spawn a new go routine for every request and the Go scheduler will not prioritize any one routine over another. We have no request tracking taking place.

I have had it on my mind for a while that we should move away from spawning go routines for requests to more of a worker/futures model as the large majority of the time spent servicing the requests is probably spent waiting on RPCs to finish. On the servers themselves processing DNS requests we also have some priority inversion issues. As the number of outstanding dns requests increases, the less time the go routines processing the RPCs will have to do the real work. All in all LIFO queueinq certainly could be great but at this point it’s probably down there on the list of optimizations needed. We have many other architectural changes that need doing. However when we do them we should certainly take this into account.