Currently, the plugin polls various Mesos APIs via HTTP to collect metrics. While this is probably fine, it could prove problematic: requests could stack up if they don't complete in time, and there have been some performance issues on very active clusters in the past (see MESOS-2353). While things are much better now that MESOS-2353 is resolved, I don't have data on how performant this will be at a very large scale or on very active clusters. Considering the amount of data the HTTP APIs will need to generate for us to satisfy #25 and #26, we might run into some issues with this approach.
I've been kicking around the idea of writing a Mesos plugin that would be able to collect just the metrics we care about from within Mesos itself on a set interval and then push them to the Snap plugin, which could be listening on a local UDP or Unix socket.
Before going down this path, it'd be great to have some performance numbers with the current polling mechanism and determine how much of an improvement we might see here. Some immediate benefits I can see include:
determining if a collection operation is still happening internally and skipping a subsequent iteration if it exceeds the collection interval
queueing metrics if the snap daemon is unavailable for some reason (is being reloaded, upgraded, etc)
assuming that we're able to collect metrics faster by using a Mesos plugin versus polling the HTTP API, we'd also get additional insight into short-lived containers (right now considering >= 1secs, assuming we can serialize large amounts of data fast enough).
Currently, the plugin polls various Mesos APIs via HTTP to collect metrics. While this is probably fine, it could prove problematic: requests could stack up if they don't complete in time, and there have been some performance issues on very active clusters in the past (see MESOS-2353). While things are much better now that MESOS-2353 is resolved, I don't have data on how performant this will be at a very large scale or on very active clusters. Considering the amount of data the HTTP APIs will need to generate for us to satisfy #25 and #26, we might run into some issues with this approach.
I've been kicking around the idea of writing a Mesos plugin that would be able to collect just the metrics we care about from within Mesos itself on a set interval and then push them to the Snap plugin, which could be listening on a local UDP or Unix socket.
Before going down this path, it'd be great to have some performance numbers with the current polling mechanism and determine how much of an improvement we might see here. Some immediate benefits I can see include:
More information about Mesos modules: