Adaptive Polling - Githubissues

sargun commented 8 years ago

@spacejam suggested that when Mesos starts up, we poll faster, in order to build up cluster state faster as the entire cluster is coming up.

spacejam commented 8 years ago

a few different approaches:

sargun's dumbest idea imaginable (in a good way :) : refreshTime becomes the time to wait after the return of the last request
have a static min and max backoff, and the proportion of some configurable threshold size in bytes of the state.json response is the proportion along the min-max range that we back-off for after it returns. bigger requests cause more backing-off, up to a configurable max back-off.
define a % of time that the master may devote to serving state.json to us. measure the request time and multiply it by 1/(that percent), and back-off for that much time.

cmaloney commented 8 years ago

Polling faster during the startup is likely to make the startup go considerably slower for tasks that don't depend upon looking up eachothers nodes. state.json is fundamentally expensive to generate and transfer. There is some perf improvements heading into it, but it still is a full state dump, which is many MB of data for a busy cluster.

Polling state.json makes a giant time gap as the mesos master libprocess process can only do that one thing at a time. If there are thousands of events coming in / out at the same time from agents and frameworks, then all of those are going to have to wait in the queue for the state.json to finish being generated and sent out. That introduces jitter into getting things done. So if someone runs a dispatch / lambda type system, that will have its critical metrics very negatively effected by polling more frequently (And an instance where you don't really know if the cluster is coming up or running, because one second there is nothing running. The next there is a lot potentially).

There is also a question of how do you know if Mesos is just coming up or if it has been running for a while and you just happened to connect in to it? How do you tell that there wasn't just an outage of Mesos masters, and the cluster is going to now need to be really busy for a while well the Mesos Masters ping the slaves and frameworks and figure out the state of the universe?

Ex: If I have a Marathon app which is a frontend web service, I want to start 1000 1-node instances across my fleet of servers. These instances are all independent, and would start very quickly if no one is pulling state.json. If someone is pulling the mesos master a lot than it will take considerably longer.

One of the worst things you can do to a cluster which is having issues currently is grabbing state.json because if a cluster is close to falling over, more state.json requests will exacerbate the issues, possibly pushing it to the point of master failure.

I really think if we want to improve this sort of thing, we need to push push through from the Mesos masters. The state deltas are tiny and quick to generate / send relative to the whole Mesos Master state. Can use the whole state.json less frequently then to just verify things don't get out of sync (And if they do regularly, detect that so that engineers can investigate and fix).

jdef commented 8 years ago

Isn't there a design doc WIP somewhere that talks about pushing events from master? Not sure what the timeline is for getting those changes into a Mesos release.

On Sun, Nov 8, 2015 at 6:24 PM, Cody Maloney notifications@github.com wrote:

Polling faster during the startup is likely to make the startup go considerably slower for tasks that don't depend upon looking up eachothers nodes. state.json is fundamentally expensive to generate and transfer. There is some perf improvements heading into it, but it still is a full state dump, which is many MB of data for a busy cluster.

Polling state.json makes a giant time gap as the mesos master libprocess process can only do that one thing at a time. If there are thousands of events coming in / out at the same time from agents and frameworks, then all of those are going to have to wait in the queue for the state.json to finish being generated and sent out. That introduces jitter into getting things done. So if someone runs a dispatch / lambda type system, that will have its critical metrics very negatively effected by polling more frequently (And an instance where you don't really know if the cluster is coming up or running, because one second there is nothing running. The next there is a lot potentially).

There is also a question of how do you know if Mesos is just coming up or if it has been running for a while and you just happened to connect in to it? How do you tell that there wasn't just an outage of Mesos masters, and the cluster is going to now need to be really busy for a while well the Mesos Masters ping the slaves and frameworks and figure out the state of the universe?

Ex: If I have a Marathon app which is a frontend web service, I want to start 1000 1-node instances across my fleet of servers. These instances are all independent, and would start very quickly if no one is pulling state.json. If someone is pulling the mesos master a lot than it will take considerably longer.

One of the worst things you can do to a cluster which is having issues currently is grabbing state.json because if a cluster is close to falling over, more state.json requests will exacerbate the issues, possibly pushing it to the point of master failure.

I really think if we want to improve this sort of thing, we need to push push through from the Mesos masters. The state deltas are tiny and quick to generate / send relative to the whole Mesos Master state. Can use the whole state.json less frequently then to just verify things don't get out of sync (And if they do regularly, detect that so that engineers can investigate and fix).

— Reply to this email directly or view it on GitHub https://github.com/mesosphere/mesos-dns/issues/336#issuecomment-154886219 .

spacejam commented 8 years ago

@cmaloney Not hitting state.json is out of scope for this. Given that we need to keep hitting it for now, what do you think about the third option I listed where an operator can explicitly state what % of the master's time should go into this? I like it because it plays the nicest in terms of capacity planning. Master message processing latency hits above the median percentile are unavoidable for now, but this allows explicit control over impact on the master, which I agree is a key concern.

cmaloney commented 8 years ago

Percentage of Mesos Master time still seems dangerous to me. I agree hitting state.json may be out of scope for what you were thinking, but it entirely should be something doable within timeline. You could hook some really rudimentary "push all taskStatusUpdates" stuff into Mesos within a week timeframe that could land in DCOS at least as a Mesos module to solve the problem.

Estimating "percentage of mesos master time" is dangerous to me because most the time the Mesos Master isn't overly active. Sometimes it becomes incredibly active though (Someone upgraded an app with 10,000 instances). In that case, mesos needs to process and ack 10,000 task status updates from the mesos agent, then send and recieve over 10,000 task status updates to the framework which had been running those tasks. If acks are not recieved within a relatively short timeout, then mesos will re-send the messages assuming they were lost to the network. If you query state.json a lot while this is happening, you can cause the Mesos Master to not process the acks it recieves fast enough (TaskStatus which are sent before the state.json request that return / enter the queue after the state.json request time out).

There are also n different instances of Mesos-DNS running and polling state.json completely unaware of eachother polling the Mesos Master state.json. In our current configuration, there are three. So each instance pinging once every 30 seconds to update itself means on average a request every 10 seconds. Already if a cluster gets large, that will cause issues. Ramping the request rate up in Mesos-DNS is likely to cause more issues more rapidly.

spacejam commented 8 years ago

The % approach automatically backs off for longer when the master is more loaded. It's still up to the capacity planner to set that % correctly, which is not possible at all with the current setup. Do you see any way that this is not better than the current approach? People want stuff to launch quickly on a new cluster, and this is a big thorn in our current customer experience that I think this can address with a relatively tiny code change.

cmaloney commented 8 years ago

If you look at relative to classic "time to install Cassandra" / etc we're still doing really, really well.

If we start making it so the startup works differently, we are making the use case of first time cluster / service startup a little bit better. That users go through once. And time to launch wise, we are still wayy ahead of manual orchestration + deployment in a classical system.

From a deployment time perspective, installing the base DCOS still takes longer than installing a framework. Installing a framework takes a couple minutes, and yes the UX is bad (Although a lot of that we could fix at other layers, and there are some more core constraints like time to download files to every host).

Practically though I think we need to focus a lot more on the overall running rather than "how fast does the demo start". Things like how do I use Mesos-DNS to address my backend and update the backend app without losing access to my data? How do we make it so the behavior is always completely predictable? Doing things like the dynamic polling interval the current completely predictable / fully understandable behavior less so. "It will show up in 30 seconds" as a general rule is easy to understand, measure, monitor, and guarantee it doesn't go wrong. With things like dynamic backoff in Marathon we have had issues that there was a cluster problem, the backoff gets increasingly long, and there was no button for an operator to say "Reset the backoff to zero and try immediately because I made an administrative change / fix".

spacejam commented 8 years ago

I agree that the worst-case behavior should be well-understood, which is achievable with a configurable truncated backoff (at both ends of the desirable range). I see this as a tiny, easily-testable change that improves initial customer experience, which is important. I want this because I don't like waiting for clusters to come up, and because I want to write the least code possible to address this.

cmaloney commented 8 years ago

It still leaves wide open the even worse experience of "after a cluster is up lots of things can cause service discovery to take too long". It's really optimizing for the startup case, which startup being slow is something customers will forgive us for. Taking out production isn't. They're both even better solved by getting push throughout the stack.

The case of spinning up the cluster 100x isn't the case which our customers will be doing for the most part (at least not in v1 deployments). Running it, and expecting things like full cluster updates to be reasonable is something which is going to be far more common.

sargun commented 8 years ago

@cmaloney Apart from this ask, the Mesos master should throttle requests, or at least indicate if a particular client is causing a degradation in QoS. I know that this kind of feature has been added to the Scheduler API (http://mesos.apache.org/documentation/latest/framework-rate-limiting/).

The other side of this - I think that it's valuable to add backoff, even independently of adaptive polling, because if enough Mesos-DNS clients are run in an odd way, you can run into an issue where load begins to collate across them, pushing up the time to generate state.json. Given that the way refreshInterval works right now, it'll effectively DDoS the Mesos master.

cmaloney commented 8 years ago

@sargun That would be a perfectly reasonable thing for Mesos to do, should file a Mesos JIRA asking for it.

As far as adding a backoff currently: I think that is of mixed value. Right now people can depend on the service discovery lag of 30 seconds in order to co-ordinate updates / upgrades. Yes, it could tip over a cluster / DDoS a Master. In practice though, it's fairly obvious when that is happening and how to fix it. With the dynamic throttling I'm still very worried people won't notice the change (If we gathered statistics on the current backoff times so that admins could be alerted / notified / etc that would be a different story. Currently we don't have any such mechanism)

mesosphere / mesos-dns

Adaptive Polling #336