Reaching timeout when getting KubeObject on big cluster

Calvinaud commented 5 months ago

What happened: Context: Infrastructure at Cluster level on a cluster with important number of namespace/object.

When trying to create a ChaosExperiments on the UI. We cannot select any App namespace after selecting the App Kind. This is due to the query timeout just like in: https://github.com/litmuschaos/litmus/issues/4308. In our case, the timeout we are reaching is the one directly from the browser (1min) since the getKubeObject query is taking too long.

The query is taking too long because our infrastructure is taking too long to answer, it taking around 2-3 for our infrastructure to send back the KubeObject list. This is probably because we have a big cluster and the function getKubeObject is not efficient enough. (for information, this is not due too a lack of resource from request/limit).

What you expected to happen: The query doesn't timeout and we can select the namespace.

Where can this issue be corrected? (optional) I think the main point to solve this issue is to have getKubeObject/GetKubernetesObjects https://github.com/litmuschaos/litmus/blob/master/chaoscenter/subscriber/pkg/k8s/objects.go#L27 more efficient/fast.

I think multiple solution is possible (so don't hesitate if you have any better solution) The first solution that can be possible is to have some parallelism on https://github.com/litmuschaos/litmus/blob/master/chaoscenter/subscriber/pkg/k8s/objects.go#L65. I don't think this is the best solution since it create a risk to DOS the api server with lot of request.

The second solution but seem harder to put in place is to not retrieve all the object of a type in every namespace at once but to separate it in two query:

the first to retrieve only the namespace list when an object type is selected.
then when the user is selecting a namespace retrieve the list of object/labels only for the selected namespace.

How to reproduce it (as minimally and precisely as possible):

Create a cluster with important number of namespace/object.
Install chaoscenter/infrastructure at cluster level.
Try to create a ChaosExperiments
(To reproduce more easily you can include some delay in the response from the API Server)

Anything else we need to know?: Two question related to this topic:

Shouldn't the order of selected field be changed to first be namespace then object type then object name ? (I think it's not really necessary the only gain is you could only show the object type present in the namespace)
Wouldn't a "free text" option still be available on the UI to be able to create a ChaosExperiment that target a object that is not deployed ? (I know it's still possible to do it in the yaml but not really user friendly).

smitthakkar96 commented 5 months ago

It would be nice to create a metadata-only informer to speed lookups for Labels or use cached client in subscriber. This should speed up the lookups

https://firehydrant.com/blog/dynamic-kubernetes-informers/ https://medium.com/@timebertt/kubernetes-controllers-at-scale-clients-caches-conflicts-patches-explained-aa0f7a8b4332

smitthakkar96 commented 5 months ago

On a side note to get better visibility on performance and stability issues would be great to kickoff efforts on having instrumentation using https://opentelemetry.io/ (logs, metrics, traces and profiles). This would allow users to monitor components of Litmus on their own O11y stack, gain insights on scaling and stability challenges and can report/contribute with more data.

litmuschaos / litmus

Reaching timeout when getting KubeObject on big cluster #4616