streamnative / function-mesh

The serverless framework purpose-built for event streaming applications.
https://functionmesh.io/
Apache License 2.0
211 stars 28 forks source link

Add Pulsar unique autoscaling metrics #457

Open tpiperatgod opened 2 years ago

tpiperatgod commented 2 years ago

The autoscaling of FunctionMesh's resources is currently controlled by HPA.

We can add some Pulsar unique metrics to the HPA to determine if the target workload needs to be scaled.

Here are two approaches:

  1. introduce KEDA to FunctionMesh since KEDA has supported the Pulsar scaler, also KEDA supports CRD as a scalable object, ref: https://keda.sh/docs/2.8/concepts/scaling-deployments/#scaling-of-custom-resources
  2. add HPA extension adapter to FunctionMesh and develop a built-in scaler that aligns with the KEDA Pulsar scaler

what do you think?

tpiperatgod commented 2 years ago

and with the new scaler, FunctionMesh can downscale the replicas of a function to 0

hpvd commented 2 years ago

+1 on this, was also thinking about using KEDA when, talking about the relationship between size and spinup duration / faster dynamic scaling in the advantages of distroless topic https://github.com/streamnative/function-mesh/issues/448

hpvd commented 2 years ago

regarding KEDA: this is a good introduction: https://medium.com/backstagewitharchitects/how-autoscaling-works-in-kubernetes-why-you-need-to-start-using-keda-b601b483d355 (the embedded video is also interesting)

hpvd commented 2 years ago

there is already a blogpost saying that KEDA may be a future direction (at the end of https://streamnative.cn/blog/engineering/2022-01-19-auto-scaling-pulsar-functions-in-kubernetes-using-custom-metrics-zh/)

hpvd commented 2 years ago

of course in some/many usecases the possibility to easily autoscale to zero would help a lot in the field of infrastructure costs...

tpiperatgod commented 2 years ago

Overview

Function Mesh's function instances can be dynamically scaled with the help of HPA based on CPU and memory metrics. However, Function Mesh has not yet been able to scale to/from 0 replica. This proposal aims to provide a solution that can implement this feature.

Motivation

Provides the ability to scale the function instances of Function Mesh to/from 0 replica.

Proposal

I propose to introduce the KEDA project as a basic solution for implementing the scaling of Function Mesh's function instances to/from 0 replica. The advantage of this solution is that Function Mesh's event engine is Pulsar, and KEDA already has a Pulsar scaler, which can use Pulsar's message backlog as a metric for function scaling.

Structure for scaling configurations:

type AdvanceScaleConfig struct {
    Driver   string            `json:"driver,omitempty"`   \\ Indicates the driver for Scaler, available: "keda"
    Topics   []string          `json:"topics,omitempty"`   \\ Indicates the topics used to trigger the Scaler
    Strategy map[string]string `json:"strategy,omitempty"` \\ Indicates the trigger strategy
}

Example:

spec:
  advanceScaleConfig:
    driver: keda
    topics:
      - persistent://public/default/my-topic-1
      - persistent://public/default/my-topic-2
    strategy: 
      msgBacklogThreshold: 10
      activationMsgBacklogThreshold: 2
      pollingInterval: 30

According to the definition of KEDA Pulsar Scaler, a Scaler is triggered by only one Topic, so if there are multiple Topics in the Function (spec.inputs), the Operator will generate a Trigger for each Topic.

Example of KEDA ScaledObject resource for the above configuration:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: <function-name>-scaler
  namespace: <function-namespace>
spec:
  scaleTargetRef:
    name: <function-sts-name>
  pollingInterval: 30
  triggers:
  - type: pulsar
    metadata:
      adminURL: http://localhost:80 # Get from spec.pulsar.pulsarConfig
      topic: persistent://public/default/my-topic-1
      subscription: sub1 # Get from spec.SubscriptionName
      msgBacklogThreshold: '10'
      activationMsgBacklogThreshold: '2'
  - type: pulsar
    metadata:
      adminURL: http://localhost:80 # Get from spec.pulsar.pulsarConfig
      topic: persistent://public/default/my-topic-2
      subscription: sub1 # Get from spec.SubscriptionName
      msgBacklogThreshold: '10'
      activationMsgBacklogThreshold: '2'

Example configuration of the Auth section, if the following is configured in Function:

spec:
  pulsar:
    tlsConfig: 
      enabled: true
      allowInsecure: true
      certSecretName: "ca-name"
      certSecretKey: "ca-key"

Example of resources corresponding to KEDA:

apiVersion: v1
kind: Secret
metadata:
  name: <function-name>-keda-tls-secrets
  namespace: <function-namespace>
data:
  cert: "ca-name"
  key: "ca-key"
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: <function-name>-keda-trigger-auth-pulsar-credential
  namespace: <function-namespace>
spec:
  secretTargetRef:
  - parameter: cert
    name: <function-name>-keda-tls-secrets
    key: cert
  - parameter: key
    name: <function-name>-keda-tls-secrets
    key: key
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: <function-name>-scaler
  namespace: <function-namespace>
spec:
  scaleTargetRef:
    name: <function-sts-name>
  pollingInterval: 30
  triggers:
  - type: pulsar
    metadata:
      tls: "enable"
      adminURL: https://localhost:8443
      topic: persistent://public/default/my-topic
      subscription: sub1
      msgBacklogThreshold: '5'
    authenticationRef:
      name: <function-name>-keda-trigger-auth-pulsar-credential
tpiperatgod commented 2 years ago

state-machine-diagram is here

tpiperatgod commented 2 years ago

of course in some/many usecases the possibility to easily autoscale to zero would help a lot in the field of infrastructure costs...

Hi @hpvd, it seems you are interested in this development, may I take the liberty to ask what company you work for? Also, what kind of cases are you using Function Mesh in?

hpvd commented 2 years ago

@tpiperatgod thanks for your question. We are still incubating our new company ;-) It's in the field of mechanical engineering... We are looking into pulsar for streaming but also for high-load, on demand batch processing. Because of the latter and the fact that we and our customers don't (always) work 24/7, scaling to zero is more than nice to have... (yes we could work with crons, but this not flexible and the amount of rules always keeps growing..) Beside this, we are interested in a strong security of everything and of course the main features of pulsar -like great performance, build in geo-replication and functions, relative low effort for constant maintenance ...

tpiperatgod commented 2 years ago

@tpiperatgod thanks for your question. We are still incubating our new company ;-) It's in the field of mechanical engineering... We are looking into pulsar for streaming but also for high-load, on demand batch processing. Because of the latter and the fact that we and our customers don't (always) work 24/7, scaling to zero is more than nice to have... (yes we could work with crons, but this not flexible and the amount of rules always keeps growing..) Beside this, we are interested in a strong security of everything and of course the main features of pulsar -like great performance, build in geo-replication and functions, relative low effort for constant maintenance ...

Oh, I see. So for now you're worried about two things.

And the community is working on these issues.

You are welcome to participate in building the community

hpvd commented 2 years ago

thanks for your warm words. Yes, there was a lot of great progress and there are many good things on the way... e.g.

and also

hpvd commented 2 years ago

these 2 points may be interesting for testing and release of this new functionality:

1) new in latest KEDA (v2.8): Activation and Scaling Thresholds

Previously in KEDA, when scaling from 0 to 1, KEDA would “activate” (scale to 1) a resource when any activity happened on that event source. For example, if using a queue, a single message on the queue would trigger activation and scale.

As of this release, we now allow you to set an activationThreshold for many scalers which is the metric that must be hit before scaling to 1.

This would allow you to delay scaling up to 1 until n number of messages were unprocessed. This pairs with other thresholds and target values for scaling from 1 to n instances, where the HPA will scale out to n instances based on the current event metric and the defined threshold values.

Details on thresholds and the new activation thresholds can be found in the KEDA concept docs

see https://keda.sh/blog/2022-08-10-keda-2.8.0-release/

2) next KEDA v2.9 is planned for Nov 3rd

(but not sure if this will happen) see https://github.com/kedacore/keda/blob/main/ROADMAP.md

hpvd commented 1 year ago

Keda 2.9 was released: https://github.com/kedacore/keda/blob/main/CHANGELOG.md#v291