jrasell / sherpa

Sherpa is a highly available, fast, and flexible horizontal job scaling for HashiCorp Nomad. It is capable of running in a number of different modes to suit different requirements, and can scale based on Nomad resource metrics or external sources.
Mozilla Public License 2.0
163 stars 8 forks source link

autoscaler: allow use of custom metric sources in autoscaler evaluation #102

Closed jrasell closed 4 years ago

jrasell commented 4 years ago

Is your feature request related to a problem? Please describe. In order to use custom metrics to scale, the metric store, such as Prometheus must be configured in such a manner to send web-hook requests to the Sherpa API when the metric is found to violate a policy. This can involve setting up and configuring additional infrastructure components, and depending on the operators experience with the additional applications, can be somewhat time consuming.

Describe the solution you'd like. The autoscaler and therefore the scaling policy document should be updated to allow the use of metrics from external providers. These values can then be used to make scaling decisions alongside using the standard memory and CPU metrics which are calculated from Nomad.

numiralofe commented 4 years ago

hi @jrasell

This is something that i have been thinking for some time and was looking at the scaler code to check how doable is to use prometheus as a possible provider to query for such metrics.

1st -> set on sherpa config parameters the prometheus endpoint:

--prometheus-api = "http://prometheus/api/v1/query?query="

2nd -> at the policy level we could set per job the url to fetch the specific metric and its correspondent threshold:

{
  "ScaleOutPrometheusThreshold":  [ "sum(rate(jvm_threads_total_ticks{my_app='java_app'}[1m]))by(java_app)", "10" ],
  "ScaleInPrometheusThreshold": [ "sum(rate(jvm_threads_total_ticks{my_app='java_app'}[1m]))by(java_app)", "5" ],
}

I think that for those that already have / are using prometheus to collect data & metrics this approach gives enough flexibility for the operator to scale based on whatever values are more appropriated, as long as they are "queryable" on prometheus.

final note: not sure if bind to a specific technology like prometheus is suitable or is the direction that sherpa (as a project) wants to take, perhaps many are also using other tools ( like elastic for instance ) to store metrics data, anyway as long it provides a "queryable" api (like prometheus) a similar approach could also be valid for such scenarios.

jrasell commented 4 years ago

@numiralofe thanks for the comment; as a solo maintainer any and all feedback is useful so I appreciate you taking the time to write. You are exactly write on the config params addition, although I am likely thinking just taking the address rather than the full query path so that any changes on the prometheus side can be tested before rolling them out. In the event of a v2 API, that query path could also become a config param.

You are correct on the thinking that binding to a specific tool is probably not the correct direction. I aim to make the metric provider easily extensible via interfaces, and the policy document flexible so user can add as many external checks, or zero if they wish. This would allow us to add any metrics API which can be queryable to Sherpa.

Something I would be curious of an additional opinion on is making the ScaleIn and ScaleOut threshold policy parameters optional. My thinking behind this is that a user might not want to use the internal autoscaler to analyse cpu and memory metrics, but rely on calling an external source. With these as optional the autoscaler handler would just skip performing these checks.

I plan on hitting this full on tonight, after I finish my regular day of writing code.

numiralofe commented 4 years ago

Hi @jrasell

My thinking behind this is that a user might not want to use the internal autoscaler to analyse cpu and memory metrics, but rely on calling an external source.

You are absolutely correct, for my case scenario, java apps usually do not go over the -Xmx memory options defined at boot time and rarely have "huge" spikes because of the way they are designed, but metrics like jvm threads / number of requests / message queue numbers / etc are important in the decision to scale them or not.

I aim to make the metric provider easily extensible via interfaces

not sure if I follow you completely on this :) just as an example (that i think is probably a standard case with more users) we store all our metrics in prometheus and the elk stack and for some cases i need info from both systems to be able to correctly evaluate if the application needs to be scaled or not.

ex. a growing number of java exceptions on the logs might indicate that the app is getting out of resources and needs to be scaled and this is a metric that can only be retrieved from the logs stack.