Autoscaling for trace and metric gateway

a-thaler commented 1 year ago

Description After solving https://github.com/kyma-project/telemetry-manager/issues/430 a manual scaling of the gateway will be supported. However, that requires introspection of the gateway metrics by the user which requires advanced knowledge. In best case the gateway scales automatically dependent on the load, saving potentially resources as well.

In the most simple form the gateway could be scaled by memory using an HPA. Here, just the proper tuning needs to be found and an HPA will be managed by the operator. However, typically the collector should be scaled on base of incoming requests as criteria. Also it should not be scaled on any problems with the backend like backpressure, see https://opentelemetry.io/docs/collector/scaling/. So a better approach is to manage the scaling based on metrics. This could be possible using k8s mechanisms by using prometheus and the prometheus-adapter or keda to feed the HPA controller with custom metrics. However, that will complicate the setup a lot.

Goal Have autoscaling of the gateway in place so that the user don't need to gain knowledge about when to scale manually. The scaling should be kept simple but feeding the purpose, so scaling up on increased ingestion but reduced or not at all on problems with the backend.

Criterias

On increased ingestion of the gateway, the gateway starts scaling out so that no data loss will happen (or only few at the scaling decision point)
If the backend has backpressure, the gateway should not scale out to the maximum so that the backend can recover
Scaling history can be observed by operations (having metrics in place)
The user can limit the max amount of replicase to safeguard costs, a default should be in place

Reasons It should not be the users concern on when to scale out

Attachments

Ideas memory: A scaling on memory can be possible by having a fixed portion of memory reserved for backpressure scenarios which is way smaller then the maximum memory. On backpressure, the gateway might scale upa bit but not scale out. That will happen only on increased ingestion rate.

metrics+sidecar: Have a prometheus as sidecar of the operator which is scraping all telemetry components. The footprint will be very small as only few endpoints and few metrics will be scraped. The operator can then do queries on the prometheus to do scaling decisions. The sidecar could be used for healthyness checks as well to indicate problems to the user via the status and providing aggragated business metrics the user could rely on.

metrics+adapter: Have a prometheus sidecar as before, but also run a prometheus-adapter being registered on the apiserver. The prometheus should serve only metrics for the specific namespace/pods to not cause any clash with users adapters. Then use a regular HPA for scaling.

kyma-bot commented 11 months ago

This issue or PR has been automatically marked as stale due to the lack of recent activity. Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 7d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Close this issue or PR with /close

If you think that I work incorrectly, kindly raise an issue with the problem.

/lifecycle stale

a-thaler commented 5 months ago

The performance of the default setup is sufficient for all known use cases. Also you can scale the gateways up in a manual ways via module configuration. With that the relevance of the feature is not that high anymore and we will postpone working on it till we face first problems with the default setup.

kyma-project / telemetry-manager

Autoscaling for trace and metric gateway #424