dishmael commented 7 years ago

Summary

Often, we have a need to gather metrics from a remote system from a centralized collector or, ideally, cluster (tribe) of snap collectors. The overarching goal is to define a single task to collect one or more metrics from a remote node and submit that task to the tribe for collection by assigning the task to a worker.

Proposal

At a configurable rate, the collectors would vote on which collector would be a master and which collectors would be used to gather the metric(s) defined in a task shared amongst members in a tribe. This can be achieved using, for example, Raft - https://raft.github.io. Busy collectors would be naturally slower to respond and so faster, under/less utilized collectors would be selected for gathering those metrics. Tribe HA (this RFC) is configurable as a grouping option allowing users to define which cluster members will operate in an HA model since not all snap telemetry tribes need to be HA.

Motivation/Use Cases

The link above (RAFT) has a decent description of how cluster consensus might work in the Tribe architecture. The following motivation and use cases are targeted.

Tribe configurable; not all members need HA
Tribe membership follows existing paradigm; all members obtain plugins and task definitions
Consensus voting amongst tribe cluster members to determine Master
Upon election, Master assigns tasking to workers (publish/subscribe model?)
Re-election occurs at predefined periods and may be based on snap telemetry daemon utilization
Task tracking needs to be considered to ensure all tasks are completed
Task execution follows existing paradigm; collection --> processing --> publishing

Benefits

Utilizing a cluster that has a Master/Workers architecture ensures high availability without duplicate polling. A task can be defined once, submitted to the tribe, executed only once, and guaranteed to collect from one of the workers.

Drawbacks

This may add overhead to the Tribes, certainly increasing the amount of cross chatter between snap telemetry instances.

Definitions

The following definitions are used in this RFC:

Master: A Node in a Tribe that has been elected to assign tasks to Workers in the tribe cluster. There can be only a single instance of a Master in a Tribe cluster.
Node: An instance of the snapteld daemon (may run one or more on a physical/virtual host).
Tribe: A collection of Nodes
Worker: One or more Nodes in a Tribe cluster that is not the Master and is designated to execute tasks.

Issues Addressed

The following issues would be satisfied by implementing this RFC:

1558 Clusters and Workers
773 Snap High Availability

candysmurf commented 7 years ago

@dishmael, thanks for your RFC. What you proposed here is more like RAFT or Zookeeper which would be great if there is a need to coordinate across clusters/tribes. It's definitely a good direction to go.

I think #773 is a low hanging fruit. Will #773 help your use cases?

dishmael commented 7 years ago

@candysmurf this RFC would satisfy the need of #773 (HA) and #1558 (No Duplicate Polling).

jtlisi commented 7 years ago

I feel I have something to add to this. I think the idea of distributing a task between a tribe is a great one in principle. I would really want to see ideas on how this would be implemented since I have some particular use cases in mind.

The primary use case I had in mind was the service discovery and collection of metrics from container bound applications. For example if you had a pod in kubernetes running a group of containers that all host a */metrics endpoint with application metrics. I would want to use the feature to dynamically schedule the collection of metrics from these endpoints.

In the above use case sharing a task is useful to accomplishing this. However, this feature seems incomplete without some associated form of service discovery. Snap needs a way to schedule and un-schedule shared tasks based on contextual data parsed using some form of service discovery similar to how Prometheus would collect from a Kubernetes cluster.

This feature doesn't necessarily have to be integrated directly into snap. This could be done using an external scheduling daemon that exists outside of snap and interacts with it using the Snap Rest API. Or it can be a directly instrumented as a new type of plugin designed for shared tasks that can pass configuration forward to a set of collectors.

Let me know what you guys think of this idea, it's something I feel would be really useful in a container based deployment.

candysmurf commented 7 years ago

I have to agree that @jtlisi has a good point that something may be achieved outside Snap. @dishmael, would you please add more your thoughts into how this will work with containers' replicas?

jcooklin commented 7 years ago

This feature doesn't necessarily have to be integrated directly into snap. This could be done using an external scheduling daemon that exists outside of snap and interacts with it using the Snap Rest API.

I tend to agree. I would like to see tribe be something that integrates with snap. This would foster more options for the management layer the obvious being one that backs into something like etcd instead of using gossip like we do today.

On Tue, Apr 4, 2017 at 2:48 PM Emily Gu notifications@github.com wrote:

I have to agree that @jtlisi https://github.com/jtlisi has a good point that something may be achieved outside Snap. @dishmael https://github.com/dishmael, would you please add more your thoughts into how this will work with containers replicas?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/intelsdi-x/snap/issues/1584#issuecomment-291642137, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0q-OUsYoj5YpmhlvbuIu6mqYCu4Pviks5rsrq5gaJpZM4MyFc5 .

andrzej-k commented 7 years ago

So it seems that before we will be able to implement this RFC we need to separate tribe from main Snap repo, is that right @jcooklin ?

@jtlisi If you'd like to monitor applications in Kubernetes you could also think about creating Snap Third Party Resource which will associate application (and its metric endpoint) with Snap (task manifest). Then you would need a watcher on Kubernetes API which will tell you when new application pod is started and check whether corresponding Snap TPR is running. Having this information all you need would be some automation to load plugins and tasks. We will be implementing such solution in the future. Work will be done under: https://github.com/intelsdi-x/snap-integration-kubernetes

candysmurf commented 7 years ago

@andrzej-k, do you have statistics of how many of our customers are using tribe?

jcooklin commented 7 years ago

So it seems that before we will be able to implement this RFC we need to separate tribe from main Snap repo, is that right @jcooklin https://github.com/jcooklin ?

Correct.

On Wed, Apr 5, 2017 at 4:18 AM Andrzej Kuriata notifications@github.com wrote:

So it seems that before we will be able to implement this RFC we need to separate tribe from main Snap repo, is that right @jcooklin https://github.com/jcooklin ?

@jtlisi https://github.com/jtlisi If you'd like to monitor applications in Kubernetes you could also think about creating Snap Third Party Resource which will associate application (and its metric endpoint) with Snap (task manifest). Then you would need a watcher on Kubernetes API which will tell you when new application pod is started and check whether corresponding Snap TPR is running. Having this information all you need would be some automation to load plugins and tasks. We will be implementing such solution in the future. Work will be done under: https://github.com/intelsdi-x/snap-integration-kubernetes

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/intelsdi-x/snap/issues/1584#issuecomment-291830485, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0q-MTZCl1BgR6IAgqQH1hbz_RR3md1ks5rs3hsgaJpZM4MyFc5 .

intelsdi-x / snap

RFC: Tribe Clusters and Worker Pattern #1584

Summary

Proposal

Motivation/Use Cases

Benefits

Drawbacks

Definitions

Issues Addressed

1558 Clusters and Workers

773 Snap High Availability