Support metrics-based workload discovery

lujiajing1126 commented 1 year ago

Is your feature request related to a problem? Please describe.

In some cases, we want to run the KRR program locally. But for the security consideration, the API server of the Kubernetes cluster cannot be accessed outside of the cluster.

So we can use the Prometheus-based workload discovery if kube-state-metrics is installed.

Describe the solution you'd like

We can do the workload based discovery with the following steps,

List Deployments together with their ReplicaSets,

replicasets = await self.metrics_loader.loader.query("count by (namespace, owner_name, replicaset) (kube_replicaset_owner{"
                                               f'namespace=~"{ns}", '
                                               'owner_kind="Deployment"})')

List Pods from a group of ReplicaSets

# owner_name is ReplicaSet names
pods = await self.metrics_loader.loader.query("count by (owner_name, replicaset, pod) (kube_pod_owner{"
                                               f'namespace="{namespace}", '
                                               f'owner_name=~"{owner_name}", '
                                               'owner_kind="ReplicaSet"})')

List containers from Pods got from step (2)

containers = await self.metrics_loader.loader.query("count by (container) (kube_pod_container_info{"
                                               f'namespace="{namespace}", '
                                               f'pod=~"{pod_selector}"'
                                               "})")

Build K8sObjectData for containers got from step (3)

async def __build_from_owner(self, namespace: str, app_name: str, containers: List[str], pod_names: List[str]) -> List[K8sObjectData]:
        return [
            K8sObjectData(
                cluster=None,
                namespace=namespace,
                name=app_name,
                kind="Deployment",
                container=container_name,
                allocations=await self.__parse_allocation(namespace, "|".join(pod_names), container_name), # find 
                pods=[PodData(name=pod_name, deleted=False) for pod_name in pod_names], # list pods
            )
            for container_name in containers
        ]

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

danielhass commented 1 year ago

In our team we would love to see this option as well. We have played around with the newly introduced -l option that is currently not released and only available on the main branch (introduced via #70). However we noticed that the workload discovery seems to be done via the Kubernetes API also in this case.

Our setup contains a central monitoring cluster where we would like to utilize krr to produce recommendations for the different clusters connected to this central monitoring solution. A metric-based workload discovery would allow us not to require Kubernetes API access to every downstream cluster that is reporting metrics to the central solution. As shown above the workload information should be extractable via the stored metrics with reasonable effort.

lujiajing1126 commented 1 year ago

In our team we would love to see this option as well. We have played around with the newly introduced -l option that is currently not released and only available on the main branch (introduced via #70). However we noticed that the workload discovery seems to be done via the Kubernetes API also in this case.

Our setup contains a central monitoring cluster where we would like to utilize krr to produce recommendations for the different clusters connected to this central monitoring solution. A metric-based workload discovery would allow us not to require Kubernetes API access to every downstream cluster that is reporting metrics to the central solution. As shown above the workload information should be extractable via the stored metrics with reasonable effort.

You can probably try my fork branch, https://github.com/lujiajing1126/krr/tree/support-prom-discovery.

We've already run this branch several times to provide resource recommendations for our production cluster. (maybe you need some tweak to the formatter)

Just

$ python krr.py simple -n <namespace> -p <prometheus-url>

is enough.

danielhass commented 1 year ago

Hey @lujiajing1126, thank you for your work and pointing us to your fork. The results so far look verify promising. We were able to generate recommendations running against our central monitoring cluster without problems so far.

Are you planning to contribute your changes back to krr? If you need any input or further testing which we can help with feel free to ping me!

lujiajing1126 commented 1 year ago

Hey @lujiajing1126, thank you for your work and pointing us to your fork. The results so far look verify promising. We were able to generate recommendations running against our central monitoring cluster without problems so far.

Are you planning to contribute your changes back to krr? If you need any input or further testing which we can help with feel free to ping me!

Thanks for your feedback! Sure, I am glad to contribute this feature back to the upstream. Since the branch contains several patches except the solution to this issue,

I would like to split it into several Pull Requests.

But so far I did not receive any code review or comments. It seems maintainers are not actively considering contributions from the community.

aantn commented 1 year ago

@lujiajing1126 hey, we appreciate the contribution and will review soon.

I'm sorry about the delay. We're a small team, so it takes sometimes takes more time than we like.

LeaveMyYard commented 1 year ago

@lujiajing1126 Hey! I really like your idea and approach, and I am currently refactoring the way KRR is working with prometheus.

But there is one thing that came to my mind about removing the usage of cluster API: we have an auto-discovery feature, that makes life easy for lots of people not making them to search for the link. But we can leave that I think, it will just not work when you don't have access, making you to provide the uri.

But there is one more thing I am thinking about: Will it be possible to get data about CRDs using your approach, WDYT? (https://github.com/robusta-dev/krr/issues/65)

LeaveMyYard commented 1 year ago

In our team we would love to see this option as well. We have played around with the newly introduced -l option that is currently not released and only available on the main branch (introduced via #70). However we noticed that the workload discovery seems to be done via the Kubernetes API also in this case.

Our setup contains a central monitoring cluster where we would like to utilize krr to produce recommendations for the different clusters connected to this central monitoring solution. A metric-based workload discovery would allow us not to require Kubernetes API access to every downstream cluster that is reporting metrics to the central solution. As shown above the workload information should be extractable via the stored metrics with reasonable effort.

@Avi-Robusta you are working with centralized monitoring, could you also join the discussion?

lujiajing1126 commented 1 year ago

But there is one thing that came to my mind about removing the usage of cluster API: we have an auto-discovery feature, that makes life easy for lots of people not making them to search for the link. But we can leave that I think, it will just not work when you don't have access, making you to provide the uri.

Yes. I think so. We can generally provide two ways of workload discovery,

Use Kubernetes API Server if provided, and automatically discover in-cluster Prom-compatible metrics server,
Fallback to fully Prom-based discovery if only --prometheus-url is given.

But there is one more thing I am thinking about: Will it be possible to get data about CRDs using your approach, WDYT? (#65)

I am not familiar with ArgoCD. We need some investigation. But I think it is prevalent: advanced Kubernetes workload needs their own controller, e.g. OpenKruise.

And I would like to contribute my branch back to the upstream. (probably I need some more work to polish the code. I guess I could finish next week)

Again, thanks to your team for the hardworking to bring us this great product!

LeaveMyYard commented 1 year ago

Yes. I think so. We can generally provide two ways of workload discovery,

Use Kubernetes API Server if provided, and automatically discover in-cluster Prom-compatible metrics server,

Fallback to fully Prom-based discovery if only --prometheus-url is given.

We agree that making it an optional behaviour is a good choise. Maybe by adding --discovery-method flag with "api-server" by default or "prometheus" as an option. With that approach we will not have to remove current implementation that works fine for mostly everyone

aantn commented 5 months ago

Hey all, just to update we have an updated PR with a new mode where KRR can operate only based on Prometheus data. So this will be supported in the near future. Please help by testing the PR!

https://github.com/robusta-dev/krr/pull/266

robusta-dev / krr

Support metrics-based workload discovery #59