litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
https://litmuschaos.io
Apache License 2.0
4.42k stars 693 forks source link

Reaching timeout when getting KubeObject on big cluster #4616

Closed Calvinaud closed 1 month ago

Calvinaud commented 5 months ago

What happened: Context: Infrastructure at Cluster level on a cluster with important number of namespace/object.

When trying to create a ChaosExperiments on the UI. We cannot select any App namespace after selecting the App Kind. This is due to the query timeout just like in: https://github.com/litmuschaos/litmus/issues/4308. In our case, the timeout we are reaching is the one directly from the browser (1min) since the getKubeObject query is taking too long.

The query is taking too long because our infrastructure is taking too long to answer, it taking around 2-3 for our infrastructure to send back the KubeObject list. This is probably because we have a big cluster and the function getKubeObject is not efficient enough. (for information, this is not due too a lack of resource from request/limit).

What you expected to happen: The query doesn't timeout and we can select the namespace.

Where can this issue be corrected? (optional) I think the main point to solve this issue is to have getKubeObject/GetKubernetesObjects https://github.com/litmuschaos/litmus/blob/master/chaoscenter/subscriber/pkg/k8s/objects.go#L27 more efficient/fast.

I think multiple solution is possible (so don't hesitate if you have any better solution) The first solution that can be possible is to have some parallelism on https://github.com/litmuschaos/litmus/blob/master/chaoscenter/subscriber/pkg/k8s/objects.go#L65. I don't think this is the best solution since it create a risk to DOS the api server with lot of request.

The second solution but seem harder to put in place is to not retrieve all the object of a type in every namespace at once but to separate it in two query:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: Two question related to this topic:

smitthakkar96 commented 5 months ago

It would be nice to create a metadata-only informer to speed lookups for Labels or use cached client in subscriber. This should speed up the lookups

https://firehydrant.com/blog/dynamic-kubernetes-informers/ https://medium.com/@timebertt/kubernetes-controllers-at-scale-clients-caches-conflicts-patches-explained-aa0f7a8b4332

smitthakkar96 commented 5 months ago

On a side note to get better visibility on performance and stability issues would be great to kickoff efforts on having instrumentation using https://opentelemetry.io/ (logs, metrics, traces and profiles). This would allow users to monitor components of Litmus on their own O11y stack, gain insights on scaling and stability challenges and can report/contribute with more data.