Open hoo29 opened 2 months ago
I also tried browsing via a rancher deployment but this suffers the same issues as well.
Please follow https://docs.harvesterhci.io/v1.3/monitoring/harvester-monitoring#from-ui to set a higher memory & cpu limit for related pod.
Promethues will use more resources in large cluster, you can set & observer & set to find a proper resources limit for your cluster.
A comment in slack says they were getting OOM'd with 7 nodes which does not seem like a particular large cluster. Is it possible to get the defaults increased to provide a better out of the box experience for more users?
It also depends on how many VMs are deployed, more VMs will have more PODs, prometheus will collect, store and analyze more data.
For the default setting, Harvester tends to keep a relative low value to make sure even single/3-node clusters can run those features.
An one-fit-all default value is not easy to define.
@hoo29 any chance we could please have a support bundle from this cluster?
@w13915984028 - as a middle ground could the pod requests be kept the same so they are schedulable on smaller clusters but the limits increased? If clusters have enough resources deployed (VMs, volumes etc) to require more monitoring pod resources, it seems likely they underlying nodes will be large enough to support this. Customers can still fine tune but it's more likely to work out of the box.
@ibrokethecloud - emailed to harvester-support-bundle@suse.com. Please can I ask that metadata such as URLs and names of resources are not posted publicly.
We have a cluster of 46 nodes and also got slow down both the Hosts
and Dashboard
. Monitoring OOMed until increasing the limits on Prometheus to 8GB. I feel it would beneficial to paginate Hosts
.
We are also currently trying to use the vGPU capabilities (NVIDIA A100s) and had to increase both CPU and memory on the harvester-pcidevices-controller
.
@torchiaf As said above, when a cluster has 46 nodes, the webpage of node is slow. Could we paginate
hosts ?
The UI can handle pagination for kubernetes resources. I'm not sure if the node
API supports it. @bk201 Do you know more about it ?
@torchiaf, @Yu-Jack help test and here is the finding:
This is example request with a limit https://192.168.1.122/k8s/clusters/c-m-87f5m4xz/v1/harvester/devices.harvesterhci.io.pcidevices?limit=1 then use continue filed of response as next request’s query string like https://192.168.1.122/k8s/clusters/c-m-87f5m4xz/v1/harvester/devices.harvesterhci.io.pcidevices?limit=1&continue=eyJyIjoiMTYzNDUwNjMiLCJjIjoiZXlKMklqb2liV1YwWVM1ck9ITXVhVzh2ZGpFaUxDSnlkaUk2TVRZek5EVXdOak1zSW5OMFlYSjBJam9pYW1GamEyNXZaR1V0TURBd01EQXdNREF3WEhVd01EQXdJbjAiLCJsIjoxfQ==
@torchiaf node
API also supports it, I tested with pcidevices cause I just used them at that moment. About pagination, this is kubernetes official documentation about that, and we do support that.
Rediscovered the rancher issue I was referring to https://github.com/rancher/dashboard/issues/8527 Will any of the improvements being worked on in there benefit harvester or are they separate?
@hoo29 That is a separate improvement on rancher API side. In your case I think the best approach would be to try to use the UI pagination for Volumes page - 8 hosts will be displayed in the same page in any case. This will require further investigation from our side.
@torchiaf I feel like there is something broken with the volumes page. On 3 separate installs we can see the main thread gets consistently blocked in the browser every 2 seconds, freezing the UI. While some pages are slow, none freeze the UI apart from the volumes page. Attached a screenshot from a profile in edge (it happens in all browsers)
Is this a known issue or unique to us?
This is with 98 volumes
Describe the bug On clusters with 8+ physical nodes, 40+ VMs, and 80+ volumes, the UI tabs for Dashboard, Hosts, and Volumes are very slow and the monitoring stack (specifically prometheus) gets killed with OOM with the default pod limits.
To Reproduce Steps to reproduce the behavior:
Expected behavior The monitoring stack default values work for larger clusters. The UI is responsive.
Support bundle Can attach if required.
Environment
Additional context I think there is a ticket for paginated UI improvements but I cannot find it now.