[BUG] Large Clusters - Harvester UI Slow Performance and Monitoring Stack Insufficient Resource Defaults

harvester / harvester

Open source hyperconverged infrastructure (HCI) software

https://harvesterhci.io/

Apache License 2.0

3.66k stars 310 forks source link

[BUG] Large Clusters - Harvester UI Slow Performance and Monitoring Stack Insufficient Resource Defaults #5770

Open hoo29 opened 2 months ago

hoo29 commented 2 months ago

Describe the bug On clusters with 8+ physical nodes, 40+ VMs, and 80+ volumes, the UI tabs for Dashboard, Hosts, and Volumes are very slow and the monitoring stack (specifically prometheus) gets killed with OOM with the default pod limits.

To Reproduce Steps to reproduce the behavior:

Create a large cluster with numerous VMs and volumes.
Try to use the UI (scrolling in all the tabs).
Deploy the monitoring stack

Expected behavior The monitoring stack default values work for larger clusters. The UI is responsive.

Support bundle Can attach if required.

Environment

Harvester ISO version: 1.3.0
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Baremetal

Additional context I think there is a ticket for paginated UI improvements but I cannot find it now.

hoo29 commented 2 months ago

I also tried browsing via a rancher deployment but this suffers the same issues as well.

w13915984028 commented 2 months ago

Please follow https://docs.harvesterhci.io/v1.3/monitoring/harvester-monitoring#from-ui to set a higher memory & cpu limit for related pod.

Promethues will use more resources in large cluster, you can set & observer & set to find a proper resources limit for your cluster.

hoo29 commented 2 months ago

A comment in slack says they were getting OOM'd with 7 nodes which does not seem like a particular large cluster. Is it possible to get the defaults increased to provide a better out of the box experience for more users?

w13915984028 commented 2 months ago

It also depends on how many VMs are deployed, more VMs will have more PODs, prometheus will collect, store and analyze more data.

For the default setting, Harvester tends to keep a relative low value to make sure even single/3-node clusters can run those features.

An one-fit-all default value is not easy to define.

ibrokethecloud commented 2 months ago

@hoo29 any chance we could please have a support bundle from this cluster?

hoo29 commented 2 months ago

@w13915984028 - as a middle ground could the pod requests be kept the same so they are schedulable on smaller clusters but the limits increased? If clusters have enough resources deployed (VMs, volumes etc) to require more monitoring pod resources, it seems likely they underlying nodes will be large enough to support this. Customers can still fine tune but it's more likely to work out of the box.

@ibrokethecloud - emailed to harvester-support-bundle@suse.com. Please can I ask that metadata such as URLs and names of resources are not posted publicly.

bathomas commented 2 months ago

We have a cluster of 46 nodes and also got slow down both the Hosts and Dashboard. Monitoring OOMed until increasing the limits on Prometheus to 8GB. I feel it would beneficial to paginate Hosts.

We are also currently trying to use the vGPU capabilities (NVIDIA A100s) and had to increase both CPU and memory on the harvester-pcidevices-controller.

w13915984028 commented 2 months ago

@torchiaf As said above, when a cluster has 46 nodes, the webpage of node is slow. Could we paginate hosts ?

torchiaf commented 2 months ago

The UI can handle pagination for kubernetes resources. I'm not sure if the node API supports it. @bk201 Do you know more about it ?

bk201 commented 1 month ago

@torchiaf, @Yu-Jack help test and here is the finding:

This is example request with a limit https://192.168.1.122/k8s/clusters/c-m-87f5m4xz/v1/harvester/devices.harvesterhci.io.pcidevices?limit=1 then use continue filed of response as next request’s query string like https://192.168.1.122/k8s/clusters/c-m-87f5m4xz/v1/harvester/devices.harvesterhci.io.pcidevices?limit=1&continue=eyJyIjoiMTYzNDUwNjMiLCJjIjoiZXlKMklqb2liV1YwWVM1ck9ITXVhVzh2ZGpFaUxDSnlkaUk2TVRZek5EVXdOak1zSW5OMFlYSjBJam9pYW1GamEyNXZaR1V0TURBd01EQXdNREF3WEhVd01EQXdJbjAiLCJsIjoxfQ==

Yu-Jack commented 1 month ago

@torchiaf node API also supports it, I tested with pcidevices cause I just used them at that moment. About pagination, this is kubernetes official documentation about that, and we do support that.

hoo29 commented 1 month ago

Rediscovered the rancher issue I was referring to https://github.com/rancher/dashboard/issues/8527 Will any of the improvements being worked on in there benefit harvester or are they separate?

torchiaf commented 2 days ago

@hoo29 That is a separate improvement on rancher API side. In your case I think the best approach would be to try to use the UI pagination for Volumes page - 8 hosts will be displayed in the same page in any case. This will require further investigation from our side.

hoo29 commented 2 days ago

@torchiaf I feel like there is something broken with the volumes page. On 3 separate installs we can see the main thread gets consistently blocked in the browser every 2 seconds, freezing the UI. While some pages are slow, none freeze the UI apart from the volumes page. Attached a screenshot from a profile in edge (it happens in all browsers)

Is this a known issue or unique to us?

hoo29 commented 2 days ago

This is with 98 volumes