kyma-project / telemetry-manager

Manager for the Kyma telemetry module
https://kyma-project.io/#/telemetry-manager/user/README
Apache License 2.0
5 stars 23 forks source link

Autoscaling for trace and metric gateway #424

Open a-thaler opened 1 year ago

a-thaler commented 1 year ago

Description After solving https://github.com/kyma-project/telemetry-manager/issues/430 a manual scaling of the gateway will be supported. However, that requires introspection of the gateway metrics by the user which requires advanced knowledge. In best case the gateway scales automatically dependent on the load, saving potentially resources as well.

In the most simple form the gateway could be scaled by memory using an HPA. Here, just the proper tuning needs to be found and an HPA will be managed by the operator. However, typically the collector should be scaled on base of incoming requests as criteria. Also it should not be scaled on any problems with the backend like backpressure, see https://opentelemetry.io/docs/collector/scaling/. So a better approach is to manage the scaling based on metrics. This could be possible using k8s mechanisms by using prometheus and the prometheus-adapter or keda to feed the HPA controller with custom metrics. However, that will complicate the setup a lot.

Goal Have autoscaling of the gateway in place so that the user don't need to gain knowledge about when to scale manually. The scaling should be kept simple but feeding the purpose, so scaling up on increased ingestion but reduced or not at all on problems with the backend.

Criterias

Reasons It should not be the users concern on when to scale out

Attachments

Ideas memory: A scaling on memory can be possible by having a fixed portion of memory reserved for backpressure scenarios which is way smaller then the maximum memory. On backpressure, the gateway might scale upa bit but not scale out. That will happen only on increased ingestion rate.

metrics+sidecar: Have a prometheus as sidecar of the operator which is scraping all telemetry components. The footprint will be very small as only few endpoints and few metrics will be scraped. The operator can then do queries on the prometheus to do scaling decisions. The sidecar could be used for healthyness checks as well to indicate problems to the user via the status and providing aggragated business metrics the user could rely on.

metrics+adapter: Have a prometheus sidecar as before, but also run a prometheus-adapter being registered on the apiserver. The prometheus should serve only metrics for the specific namespace/pods to not cause any clash with users adapters. Then use a regular HPA for scaling.

kyma-bot commented 11 months ago

This issue or PR has been automatically marked as stale due to the lack of recent activity. Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

You can:

If you think that I work incorrectly, kindly raise an issue with the problem.

/lifecycle stale

a-thaler commented 5 months ago

The performance of the default setup is sufficient for all known use cases. Also you can scale the gateways up in a manual ways via module configuration. With that the relevance of the feature is not that high anymore and we will postpone working on it till we face first problems with the default setup.