[Feature] Cluster Controller support on secondaries

passionInfinite commented 8 months ago

Kubecost Helm Chart Version

v2.0.2

Kubernetes Version

v1.27.7

Kubernetes Platform

AKS

Description

First Approach:

Setting federatedETL.agentOnly: true and clusterController.enabled: true The cluster controller has CC_CCL_COST_MODEL_PATH and CC_KUBESCALER_COST_MODEL_PATH environment variable pointing to default (9090) /model path.

Second Approach: Setting federatedETL.agentOnly: true, clusterController.enabled: true and setting service.port:9003 and service.targetPort: 9003 . The cluster controller has CC_CCL_COST_MODEL_PATH and CC_KUBESCALER_COST_MODEL_PATH environment variable pointing to 9003 but still using /model path which is not available because it is not going through nginx proxy.

For both the above approaches it fails with below messages:

 Kubescaler setup failed error="creating a Kubescaler: recommendation service unavailable: unavailable because status (404) is invalid"                             │
│ panic: runtime error: invalid memory address or nil pointer dereference                                                                                                                     │
│ [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x224665e]                                                                                                                     │
│                                                                                                                                                                                             │
│ goroutine 1 [running]:                                                                                                                                                                      │
│ main.main()                                                                                                                                                                                 │
│     /app/cmd/clustercontroller/main.go:237 +0x53e

Steps to reproduce

Use exact same steps mentioned in description and it should be reproduced.

Expected behavior

Either those two variables need to configurable through values OR any other approach that Kubecost Team recommends should help Cluster Controller running with agentOnly mode.

Impact

We can't run it as agentOnly mode in Federated ETL clusters.

Screenshots

No response

Logs

│ 2024-02-21T14:33:21Z INF Determined to be running in a cluster. Using in-cluster K8s config.                                                                                                │
│ 2024-02-21T14:37:31Z ERR Kubescaler setup failed error="creating a Kubescaler: recommendation service unavailable: unavailable because status (404) is invalid"                             │
│ panic: runtime error: invalid memory address or nil pointer dereference                                                                                                                     │
│ [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x224665e]                                                                                                                     │
│                                                                                                                                                                                             │
│ goroutine 1 [running]:                                                                                                                                                                      │
│ main.main()                                                                                                                                                                                 │
│     /app/cmd/clustercontroller/main.go:237 +0x53e

Slack discussion

No response

Troubleshooting

[X] I have read and followed the issue guidelines and this is a bug impacting only the Helm chart.
[X] I have searched other issues in this repository and mine is not recorded.

passionInfinite commented 8 months ago

Happy to contribute on this bug but needs help in deciding which approach to take.

AjayTripathy commented 8 months ago

cc @michaelmdresser want to weigh in? I lean towards the first approach given the relative simplicity (setting less things is good) but would defer to you on what is easiest.

michaelmdresser commented 8 months ago

Does agentOnly = true imply that the Aggregator container is not running? (I don't remember).

If so, this isn't a supported configuration method at the moment. Cluster Controller relies on Kubecost data APIs to make decisions; that means it needs the thing which provides those APIs (Aggregator) to be available.

In principle, I think a deployment that includes Cluster Controller is almost by definition not an "agent only" deployment.

passionInfinite commented 8 months ago

@michaelmdresser @AjayTripathy My thought process was little bit different. agentOnly mode only runs the cost-model which generates all the ETLs related to the usage metrics. Also, it exposes the model endpoint that can be used by the cluster-controller to perform the automated savings. We really want to disable frontend as teams often gets confused between Federated UI and the secondaries frontend. Thoughts?

michaelmdresser commented 8 months ago

As mentioned in https://github.com/kubecost/cost-analyzer-helm-chart/issues/3172#issuecomment-1965251720, I think that we may not want the FE to exist at all when Aggregator is disabled.

Also, it exposes the model endpoint that can be used by the cluster-controller to perform the automated savings

This is unfortunately not true, even though it seems intuitive. The /model API prefix exists as part of a legacy compatibility approach. I really do not recommend having Cluster Controller attempt to target the cost-model container's APIs in Kubecost v2.0.0+. If you want to use automated savings via Cluster Controller on a secondary cluster, I believe the only supported method is to have Aggregator enabled.

With that said, it is still certainly reasonable to request an ability to keep the backend running (for Cluster Controller support) while disabling the frontend. Would that help you @passionInfinite?

@kwombach12 for tracking

passionInfinite commented 8 months ago

@michaelmdresser What will be the side effects of running the aggregator (assuming that it will be the backend for cluster controller) on secondaries?

michaelmdresser commented 8 months ago

The only side effect should be the resource consumption of the Aggregator container.

A concern I have with this approach is that the resource consumption of Aggregator may be as high as the primary because the software tries to be "smart" about picking the data store to build from -- it may be the case that the secondary Aggregator will build all data, not just the data for its local cluster. This is a gap in my understanding; it is possible someone else has tested this idea.

passionInfinite commented 8 months ago

@michaelmdresser kubecost/cost-analyzer-helm-chart#3184 This will help not to run the aggregator and still having frontend running with cluster controller. This will help user to reduce the impact to lower and can still be migrated to v2.x.x

Though this issue is more towards the agentOnly support. Happy to contribute over here as well!

passionInfinite commented 8 months ago

@michaelmdresser Do we have any update on this one? We have got one fix for running it as agent only mode. Now only needed part is how cluster controller can work with agent only mode.

AjayTripathy commented 8 months ago

Hi @passionInfinite we're working on it. there are some security implications on the agent reaching cross-cluster to receive data to make changes from within a cluster. Could you help me understand the priority here? My understanding is you can run with more than just the agent and use cluster controller for now though it is a bit heavier to do so.

passionInfinite commented 8 months ago

Yes, I think we can move forward with frontend enabled for now but that option is not helping us team getting onboarded to Federated Dashboard. Agent Only mode will help us both in terms of resources as well as making people understand to see Federated Kubecost dashboard and not the secondary cluster dashboard.

passionInfinite commented 8 months ago

@AjayTripathy Does aggregator will be required on secondaries? My thinking was cost-model is responsible to upload the ETLs to the storage and thus secondaries only require cost-model to ship those ETLs. Aggregator running on primary will read those ETLs. Is it correct understanding?

passionInfinite commented 7 months ago

Any info on the above one?

AjayTripathy commented 7 months ago

Sorry for the late response. Since we currently need to serve queries on the secondaries for cluster controller, aggregator needs to run in the secondaries

passionInfinite commented 7 months ago

@AjayTripathy Can we bump up the priority for this one? As having hundreds of secondaries. This might not be a good route for us to use aggregator running on secondaries. Either we need to come up with workaround or solution to support kubescaler (cluster-controller) without aggregator OR supporting aggregator not to compute all the clusters. Just care about secondary cluster (meaning act differently than Primary's Aggregator). We are stuck with this right now as we don't want to blast of secondaries in those many clusters.

teevans commented 7 months ago

Hey @passionInfinite just to level set expectations this isn't a trivial addition to the cluster controller component given the large risk area. This is something that I'm certain our product team would love to partner with you on, but this could be a several month endeavor as opening up the ability for cross cluster communication can cause so major security vulnerabilities. CC @kwombach12 / @chipzoller

passionInfinite commented 6 months ago

@teevans Why do we require cross cluster communication? Can't we run the auto scaler on secondaries as secondaries will be having the ETLs as well to serve the savings metrics no?

CC: @chipzoller / @michaelmdresser

teevans commented 6 months ago

@passionInfinite - They have the etl files, but they wouldn't serve the data the same way. In theory we could build it that way, but that would require running the aggregator on each secondary to serve the data which wouldn't be resource efficient at all.

chipzoller commented 6 months ago

Since this appears to ultimate boil down to a feature request, I've transferred to features-bugs and renamed, labeled.

passionInfinite commented 5 months ago

can we simply point cluster controller to use the federated kubecost endpoint for fetching savings?

I believe there was a variable which supports this config change?

chipzoller commented 3 weeks ago

Hello, in an effort to consolidate our bug and feature request tracking, we are deprecating using GitHub to track tickets. If this issue is still outstanding and you have not done so already, please raise a request at https://support.kubecost.com/.

kubecost / features-bugs