Provide an OpenTelemetry scaler

tomkerkhove commented 2 years ago

Proposal

OpenTelemetry allows applications/vendors to push metrics to a collector or integrate it's own exporters in the app.

KEDA should provide an OpenTelemetry scaler which is used as an exporter so we can pull metrics and scale accordingly.

Scaler Source

OpenTelemetry Metrics

Scaling Mechanics

Scale based on returned metrics.

Authentication Source

TBD

Anything else?

OpenTelemetry Metrics are still in beta but going GA by end of the year.

Go SDK: https://github.com/open-telemetry/opentelemetry-go

mknet3 commented 2 years ago

It's a really good improvement, I will have a look to see if I can help with this topic 🙂

tomkerkhove commented 2 years ago

Awesome, thank you!

JorTurFer commented 2 years ago

@mknet3 , ping me if you need help ;)

mknet3 commented 2 years ago

just to confirm, I'm on it and I will help with this scaler

tomkerkhove commented 2 years ago

Great, thanks!

mknet3 commented 2 years ago

Hi @tomkerkhove, I have had a look at this issue and I would like to clarify some things. AFAIK the goal of this issue is to provide an scaler based on metrics exported by an exporter configured in the collector. This exporter will expose metrics in a KEDA format to be read by the scaler. Quick question, does the exporter already exist or is there a plan to develop it? (I suppose will be in opentelemetry-collector-contrib). This question is to figure out what will be the format of the exposed data to pull it in the scaler.

tomkerkhove commented 2 years ago

That would be part of the investigation but I think we'll need to build our own exporter to get the metrics in; or use the gRPC OTEL exporter / HTTP OTEL exporter as a starting point to push it to KEDA.

I'd prefer the latter approach to get started as we don't have a preference on the metric format, so OTEL is fine.

JorTurFer commented 2 years ago

@mknet3 prefer to keep it free for the moment because it's his first task with golang

sushmithavangala commented 2 years ago

Working on this

tomkerkhove commented 2 years ago

Before we go all in, might be good to post a proposal here @SushmithaVReddy to avoid having to redo things but think relying on OTEL exporter is best

sushmithavangala commented 2 years ago

@tomkerkhove , sure. I'll put a proposal here before we start the implementation.

Quick doubt: Is the idea here to scale based on the metrics obtained from the data type -go.opentelemetry.io/otel/exporters/otlp/otlpmetrics ?

sushmithavangala commented 2 years ago

@tomkerkhove Will KEDA be acting as a collector that gets metrics data from an exporter? Is the idea to create metrics using https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument and observe through hpa to scale accordingly? Slightly confused on the term exporter and collector w.r.t KEDA. Plausible solution looks like one where user has an exporter and exports metrics, keda connects to this exporter and gets metrics (collector?) for scaling decision based on the same metrics being mentioned in scaled object.

tomkerkhove commented 2 years ago

The idea is to use the OTEL exporter (https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md) from which KEDA fetches metrics to make scaling decision on.

This is similar to how we integrate with Prometheus where we pull the metrics from Prometheus and move on, however, here it's in OTEL format coming from the expected OTEL exporter that end-users have to add to their OTEL collector (so not up to KEDA)

From an end-user perspective, they should give us:

Uri of OTEL endpoint to talk to on the collector (but they add the following to their collector: https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md#getting-started)
Optional parameter to use gRPC or HTTP (but we can just start with gRPC for now as well)

Hope that helps?

sushmithavangala commented 2 years ago

The idea is to use the OTEL exporter (https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md) from which KEDA fetches metrics to make scaling decision on.

This is similar to how we integrate with Prometheus where we pull the metrics from Prometheus and move on, however, here it's in OTEL format coming from the expected OTEL exporter that end-users have to add to their OTEL collector (so not up to KEDA)

From an end-user perspective, they should give us:

Uri of OTEL endpoint to talk to on the collector (but they add the following to their collector: https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md#getting-started)

Optional parameter to use gRPC or HTTP (but we can just start with gRPC for now as well)

Hope that helps?

This helps Tom. Thanks!

sushmithavangala commented 2 years ago

Before we go all in, might be good to post a proposal here @SushmithaVReddy to avoid having to redo things but think relying on OTEL exporter is best

@tomkerkhove any thoughts on the scaled object here(ref below). The idea is to use OTEL (https://pkg.go.dev/go.opentelemetry.io/otel) and connect to the endpoint mentioned in the scaledobject and pull the metric value and compare to the threshold to scale.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: opentelemetry-scaledobject
  namespace: keda
  labels:
    deploymentName: dummy
spec:
  maxReplicaCount: 12
  scaleTargetRef:
    name: dummy
  triggers:
    - type: opentelemetry
      metadata:
        exporter: http://otel-collector:4317
      metrics:
         - metricName: http_requests_total
           threshold: '100'
      authenticationRef:
        name: authdata

I was also wondering about scenario's where users want to pull multiple metrics from their application and scale based on conditions on the metrics. Eg as below

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: opentelemetry-scaledobject
  namespace: keda
  labels:
    deploymentName: dummy
spec:
  maxReplicaCount: 12
  scaleTargetRef:
    name: dummy
  triggers:
    - type: opentelemetry
      metadata:
        exporter: http://otel-collector:4317
      metrics:
         -  metricName: http_requests_total
             threshold: '100'
         operator: greaterthan
      - metricName: http_timeouts
         threshold:  '5'
             operator: lesserthan
        query: http_requests_total and http_timeouts
      authenticationRef:
        name: authdata

Any ideas on what is the scope of the scalar we'll be building in terms of multiple metrics?

tomkerkhove commented 2 years ago

It's ok for me to use that package since that's the official SDK - Thanks for checking.

I don't see the difference between both proposals other than one vs multiple metrics though? Can you elaborate on it?

In terms of supporting multiple metrics - I'd argue that given we support multiple triggers it might be more aligned with other scalers to only support 1 metric per trigger to keep a consistent approach in KEDA. The only consideration I would have here is performance but I think we can manage that in the implementation. Thoughts @zroubalik @JorTurFer?

Based on that we'll need to review the YAML spec but in general I think it's ok; however if we use multiple levels then I would use exporter.url instead of exporter given we might need auth in the future or similar settings.

sushmithavangala commented 2 years ago

It's ok for me to use that package since that's the official SDK - Thanks for checking.

I don't see the difference between both proposals other than one vs multiple metrics though? Can you elaborate on it?

In terms of supporting multiple metrics - I'd argue that given we support multiple triggers it might be more aligned with other scalers to only support 1 metric per trigger to keep a consistent approach in KEDA. The only consideration I would have here is performance but I think we can manage that in the implementation. Thoughts @zroubalik @JorTurFer?

Based on that we'll need to review the YAML spec but in general I think it's ok; however if we use multiple levels then I would use exporter.url instead of exporter given we might need auth in the future or similar settings.

Yes @tomkerkhove the proposals point out multiple metrics usage as you understood. I agree with the consistency over other scalars we have, but I'm concerned about how much value will our scaling add considering it can scale on a single metric where open-telemetry's is majorly used to spit a lot of metrics.

nitpick: If we have one metric/scaled object and user wants to scale based on multiple metrics and goes ahead and creates that many scaled objects, I wonder how we handle concurrent scenarios where multiple metrics will result in scaling (over scaling? because the scaled up instances could've been reused? )

tomkerkhove commented 2 years ago

It would be nice to also make the protocol configurable given OTEL supports both http and gRPC

tomkerkhove commented 2 years ago

It's ok for me to use that package since that's the official SDK - Thanks for checking.

I don't see the difference between both proposals other than one vs multiple metrics though? Can you elaborate on it?

In terms of supporting multiple metrics - I'd argue that given we support multiple triggers it might be more aligned with other scalers to only support 1 metric per trigger to keep a consistent approach in KEDA. The only consideration I would have here is performance but I think we can manage that in the implementation. Thoughts @zroubalik @JorTurFer?

Based on that we'll need to review the YAML spec but in general I think it's ok; however if we use multiple levels then I would use exporter.url instead of exporter given we might need auth in the future or similar settings.

Yes @tomkerkhove the proposals point out multiple metrics usage as you understood. I agree with the consistency over other scalars we have, but I'm concerned about how much value will our scaling add considering it can scale on a single metric where open-telemetry's is majorly used to spit a lot of metrics.

nitpick: If we have one metric/scaled object and user wants to scale based on multiple metrics and goes ahead and creates that many scaled objects, I wonder how we handle concurrent scenarios where multiple metrics will result in scaling (over scaling? because the scaled up instances could've been reused? )

Ha, but we have this covered already today.

Customer should only create 1 SO per scale target (which we will provide validation soon). However, 1 SO can have 1 or more triggers and start scaling as soon as one of them meets the criteria. You can learn more about that in our concepts.

sushmithavangala commented 2 years ago

Right, the below one makes sense?

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: opentelemetry-scaledobject
  namespace: keda
  labels:
    deploymentName: dummy
spec:
  maxReplicaCount: 12
  scaleTargetRef:
    name: dummy
  triggers:
    - type: opentelemetry
      metadata:
        exporter:
           protocol: grpc
           url: http://otel-collector:4317
    metric:
           name: http_requests_total
           threshold: '100'
        authenticationRef:
          name: authdata
    - type: opentelemetry
        metadata:
          exporter:
             protocol: grpc
             url: http://otel-collector:4317
          metric:
             name: http_errors
             threshold: '10'
        authenticationRef:
          name: authdata

zroubalik commented 2 years ago

Yeah, this is correct, you can define multiple triggers per SO. Just one thing, the metric.name is related to otel?

sushmithavangala commented 2 years ago

Yeah, this is correct. Just one thing, the metric.name is related to otel?

metric.name will be used to pull the metric. It should match with the name user have in their instrumented application.

zroubalik commented 2 years ago

Okay, and let's make the trigger metadata flat, to be in sync with other scalers:

Something like this, feel free to rename/update the fields to follow OTEL convetions.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: opentelemetry-scaledobject
  namespace: keda
  labels:
    deploymentName: dummy
spec:
  maxReplicaCount: 12
  scaleTargetRef:
    name: dummy
  triggers:
    - type: opentelemetry
      metadata:
           protocol: grpc
           exporter: http://otel-collector:4317
           metric: http_requests_total
           threshold: '100'
        authenticationRef:
          name: authdata

tomkerkhove commented 2 years ago

Sounds good to me.

markallanson commented 2 years ago

Maybe I misunderstand something about this conversation, but given all pods in a replicaset will contain an otel collector, which one would the keda autoscaler talk to in order to make the decisions?

Also, how would you apply aggregates across metric labels?

tomkerkhove commented 2 years ago

KEDA will not manage the OTEL collector and is something you'd need to run separately next to KEDA/in your cluster.

Does that clarify it?

markallanson commented 2 years ago

Sorry maybe my question was not clear enough.

If you have 10 pods, all of which have otel sidecars running, which will keda talk to? If just one, it won't have enough information to base scaling decisions. If it talks to all of them then how will it generate aggregates of the data across all?

tomkerkhove commented 1 year ago

There is no sidecar involved, there will be a separate deployment that KEDA integrates with through a Kubernetes service. End-users will have to bring their own OpenTelemetry Collector: https://opentelemetry.io/docs/collector/deployment/

tomkerkhove commented 1 year ago

Any update on this @SushmithaVReddy ?

tomkerkhove commented 1 year ago

The priorities of @SushmithaVReddy have changed and no longer has time to complete the task so I'm unassigning her.

raorugan commented 1 year ago

This scaler work is not active at the moment. We will revisit this in the future. In the mean time if anyone is interested to pick this up please notify here so we know that you are working on this. Thanks for your contributions!

fira42073 commented 1 year ago

I'm willing to help with this scaler. I will read these discussions couple more times, summarize everything in one comment and then start development. You can assign this issue to me if that's okay.

tomkerkhove commented 1 year ago

Awesome, thank you @Friedrich42! Just so you know, KEDA 2.9 will ship in less than a month so be aware of that if you want to have it in there but no pressure from our side as 2.10 is fine as well.

fira42073 commented 1 year ago

As far as I understood, please correct me if I'm wrong, user will have to deploy otel agent in cluster and this scaler will be consumer.

Config will look something like:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: opentelemetry-scaledobject
  namespace: keda
  labels:
    deploymentName: dummy
spec:
  maxReplicaCount: 12
  scaleTargetRef:
    name: dummy
  triggers:
    - type: opentelemetry
      metadata:
           protocol: grpc
           exporter: http://otel-collector:4317
           metric: http_requests_total
           threshold: '100'
        authenticationRef:
          name: authdata

Question: how will authentication object look like for this scaler? Or are all auth objects the same?

Assumption: we will use only one metric from otel for scaler, correct me if I'm wrong

JorTurFer commented 1 year ago

How are we going to aggregate opentelemetry metrics? I mean, we can't assume that users will use a single collector instance, in fact, we should assume that there are multiple instances, and AFAIK, each instance could have different metrics because the apps don't push the metrics to all of them. Even worse, if users use opentelemetry operator, the collector could be a sidecar container on each pod. Are we going to request another collector instance where all the metrics are pushed to (e.g. using the otlpexporter from other collectors)? just to clarify, I'm not against this option

tomkerkhove commented 1 year ago

I think the above is correct, but I would rename exporter to uri.

How are we going to aggregate opentelemetry metrics? I mean, we can't assume that users will use a single collector instance, in fact, we should assume that there are multiple instances, and AFAIK, each instance could have different metrics because the apps don't push the metrics to all of them.

Metrics are pushed in the collector and we just consume them; we shouldn't have to worry about all of this as this is a OTEL problem. https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md

JorTurFer commented 1 year ago

Metrics are pushed in the collector and we just consume them; we shouldn't have to worry about all of this as this is a OTEL problem. https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md

Are we going to store the metrics internally? I mean, that exporter pushes the metrics to another server, it's not a pulling endpoint AFAIK https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/otlp.go#L103-L119.

In the other hand, as a user I'd not like to have the collector without HA to use it with KEDA, observability is crucial and having HA it's required, so each instance of the collector could have different metrics because each app pushes its metrics randomly to each collector instance, so if KEDA hits only one of the, we could miss some metrics, that's why I asked about aggregating or having another collector on top of them only for KEDA. If I were a user, how would I configure this scaler when my collector deployment has >=2 instances?

tomkerkhove commented 1 year ago

It's push indeed so we'd just need an endpoint to store the latest metric in memory and that's fine; am I missing something?

Based on your remark there is a serious concern but I'm missing it I think.

JorTurFer commented 1 year ago

Most probably I have missed some important point in the thread :/ Are the collector instances going to push metrics to KEDA, or is KEDA who will connect to the collector?

If it's KEDA who established the connection, we need to take care about the collection instances because each collector instance could have different values (and we need to connect to all of them, not just with one).
If the collector instances will be whose stablish the connections and push the metrics to KEDA, we need to be care about the performance, collector can potentially push a huge amount of metrics we need to manage (even if we just update the stored value based on time).

Based on the trigger, I guess that KEDA will establish a gRPC connection to the collector instance (exporter key) and then the collector will use that channel to push metrics to KEDA. What will happen if there are 2 or more collector instances? Will KEDA establish a connection with every collector instance? What about collector autoscaling?

TBH, my knowledge gap here could be the problem because I'm a noob otel user and I can be overthinking the things

fira42073 commented 1 year ago

@JorTurFer I'm trying to find the answer to the same question about aggregation of multiple otel agents, but I have no answer yet. Probably need to take a look at how jaeger or any other visualizer for otel did this

tomkerkhove commented 1 year ago

We didn't spec this out to this detail yet, but in my opinion push is the best way to avoid polling constantly and just let them send the metrics to us.

What will happen if there are 2 or more collector instances? Will KEDA establish a connection with every collector instance? What about collector autoscaling?

That's the problem of OpenTelemetry Collector as this is just how their receivers work :shrug: That's something they have to fix if they offer push-based metrics.

I'm eager to know what @zroubalik thinks

fira42073 commented 1 year ago

Are there any new thoughts? @zroubalik?

zroubalik commented 1 year ago

I am probably not fully up to speed with all the conversation here, but how does other metrics related tools handle this (jaeger,..)? I don't think we are the first one that hit this issue. We should stick to what is common in this domain.

By chance, can we use push scaler with this scenario?

fira42073 commented 1 year ago

From what I see, jaeger https://github.com/jaegertracing/jaeger/blob/b92aa3edc5b682ef0afa89463b069630f3c273cc/cmd/collector/app/handler/otlp_receiver.go uses otlpreceiver https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiver

26tanishabanik commented 1 year ago

@tomkerkhove , @JorTurFer , @Friedrich42 , is the idea here to build a KEDA exporter? If yes, I have a few questions:

How will be format for different OTEL receivers be generalised?
Would we need any processor in between the receiver and the keda exporter, if someone wants to query to get the metric?

I have been researching on a similar topic for a while now, while I was customising the OTEL kafka receiver to solve a problem my organisation is working on.

After researching thoroughly on the this repo: https://github.com/open-telemetry/opentelemetry-collector-contrib, I have seen that whether it is a connector, exporter, processor or a receiver, everyone has their own set of configs, how will we deal when we need metrics in a certain way with maybe a certain query from receiver side so that keda exporter can get the metrics?

tomkerkhove commented 1 year ago

I think in our case we can just use the built-in OTEL gRPC/HTTP exporter and use that; I don't personally see the need to create a dedicated exporter.

May I ask why you believe that would be an added value?

26tanishabanik commented 1 year ago

I think in our case we can just use the built-in OTEL gRPC/HTTP exporter and use that; I don't personally see the need to create a dedicated exporter.

May I ask why you believe that would be an added value?

Sure, I thought that maybe attaching any data source receiver would be easier

Can you kindly elaborate, how will the OTEL gRPC/HTTP exporter and can be integrated? And how will we handle different data formats with it coming from the data source, let's say if someone wants to query a data source for scaling?

tomkerkhove commented 1 year ago

We didn't spend much investigation on this yet but I'm happy to review proposals.

neelanjan00 commented 1 year ago

We didn't spend much investigation on this yet but I'm happy to review proposals.

Pardon my ignorance since I don't have extensive experience with OTEL collectors, but can you explain how we plan to provide the parameters required for defining the event pertaining to a respective data source? Will they somehow be defined in the collector config itself or are we going to provide them from KEDA?

JorTurFer commented 12 months ago

Please, ignore this: I deleted my own comment with the wrong browser 🤦

I'm not sure if we can use OpenTelemetry as scaler because OpenTelemtry doesn't store the data, it's just a "producer and communication protocol", I mean, OpenTemetry defines how to generate and send the data, but it doesn't have any store where we can query the values. For achieving this, we (KEDA) should be a data store using OTLP (OpenTelemetry Protocol) for receiving the telemetry information. You can't just ask the collector about the information because collector isn't a backend store, it's a "routing pipe".

I wouldn't like to receive all the telemetry in KEDA to scale based on it because it'd be crazy and we would need to manage it securely, having access to ALL telemetry data in our side. I think that we can close this issue because it doesn't make sense (IMO). End users should use the proper backend storage scaler (loki, prometheus, elastic, etc) to scale based on them.

kedacore / keda