TachunLin commented 2 years ago

Describe the bug

This issue originate and continue the tracking in #1531
After staying on the dashboard page for a period of time (e.g 20 minutes)

The Prometheus monitoring chart become empty
Click the Reload can't recover it back
The only workaround is to refresh the page, but put it to idle for another period of time, the monitoring chart become empty again

To Reproduce Steps to reproduce the behavior:

Prepare a one v1.0.1 harvester cluster
Open the Dashboard page
Create a virtual machine
Put the VM metrics page to monitor the chart display status
Put the dashboard page to monitoring the chart display status

Expected behavior

Prometheus monitoring chart on dashboard page should always keep display the change without becoming empty.

Support bundle

supportbundle_e54fcdb3-17f5-42ad-a748-3d51e3afc40a_2022-03-14T06-38-28Z.zip

Environment:

Harvester ISO version: v1.0.1
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 1 node harvester on local kvm

Additional context Add any other context about the problem here.

The Prometheus pod running as expected, the node-exporter pod did not restart

rancher@harvester-node-2:~> sudo -i kubectl get pods -n cattle-monitoring-system
NAME                                                     READY   STATUS    RESTARTS   AGE
prometheus-rancher-monitoring-prometheus-0               3/3     Running   0          4h52m
rancher-monitoring-grafana-d9c56d79b-kcjwh               3/3     Running   0          4h52m
rancher-monitoring-kube-state-metrics-5bc8bb48bd-49l2q   1/1     Running   0          4h52m
rancher-monitoring-operator-559767d69b-sq58h             1/1     Running   0          4h52m
rancher-monitoring-prometheus-adapter-8846d4757-d65xh    1/1     Running   0          4h52m
rancher-monitoring-prometheus-node-exporter-d7xsn        1/1     Running   0          4h52m
rancher-monitoring-prometheus-node-exporter-lxrrv        1/1     Running   0          4h24m
rancher-monitoring-prometheus-node-exporter-pspnn        1/1     Running   0          4h39m

Use the default monitoring setting
The VM metrics display correctly without empty

w13915984028 commented 2 years ago

Happend to meet it, debug info:

Click the Reload, the UI was stucking in Loading

the Chrome debug showed: a couple of HTTP GET were done successfully, no failure, no on-going

Comparing, when it works, the switch from "VM Metrics" to "Cluster Metrics" triggers those first bunch of HTTP GET It had more items.

and then continuously QUERY

Question: As backend nginx has no HTTP error, prometheus/grafana PODs were also in good state. When UI metrics was "Loading", what was missing? which line of the UI code was suspending ? thanks.

n313893254 commented 2 years ago

Thanks for the reproduced information, I can reproduce refresh not work by the following hack steps.

To Reproduce

Prepare a harvester
Go to the dashboard page, and observe the metircs, it will work as expected.
Scale-down cattle-monitoring-system/rancher-monitoring-grafana of deployment to 0.
Failed to load graph banner will show in metrics tab.
Scale-up cattle-monitoring-system/rancher-monitoring-grafana of deployment to 0.
Click reload button

Expected behavior The metrics graph should reload and work as expected.

w13915984028 commented 2 years ago

2043 may be related too, please take a look, thanks.

harvesterhci-io-github-bot commented 2 years ago

Pre Ready-For-Testing Checklist

[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
https://github.com/harvester/harvester/issues/2150#issuecomment-1099972410
https://github.com/harvester/harvester/issues/2150#issuecomment-1134370921 (new)
[ ] Is there a workaround for the issue? If so, where is it documented? The workaround is at:
[x] Does the PR include the explanation for the fix or the feature?

~* [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at:~

~ [ ] Have the backend code been merged (harvester, harvester-installer, etc) (including `backport-needed/`)? The PR is at~

~* [ ] Which areas/issues this PR might have potential impacts on? Area Issues~

~* [ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at~

[x] If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at
https://github.com/rancher/dashboard/pull/5659
https://github.com/harvester/dashboard/pull/312 (new)
[ ] If labeled: require/doc Has the necessary document PR submitted or merged? The documentation issue/PR is at
[ ] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue? The automation skeleton PR is at The automation test case PR is at The issue of automation test case implementation is at (bot will auto create one using the template)
[ ] If labeled: require/integration-test Has the PR includes the integration test? The integration test PR is at
[ ] If labeled: require/manual-test-plan Has the manual test plan been documented? The updated manual test plan is at
[ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility? The compatibility issue is filed at

harvesterhci-io-github-bot commented 2 years ago

Automation e2e test issue: harvester/tests#304

n313893254 commented 2 years ago

The fresh failed bug has been fixed in the latest harvester UI. Related PR: https://github.com/rancher/dashboard/pull/5659

TachunLin commented 2 years ago

Verified on master-2a4f4de5-head (04/27). The reload issue have already fixed.

When Prometheus monitoring chart empty, Click Reload can recover monitoring chart back to display

Test Information

Test Environment: 3 nodes harvester on local kvm machines
Harvester version: master-2a4f4de5-head (04/27)

Verify Steps

Prepare 3 nodes v1.0.1 harvester cluster
Open the Dashboard page
Create a virtual machine
Put the VM metrics page to monitor the chart display status
Put the dashboard page to monitoring the chart display status
When dashboard monitoring chart display empty
Click the Reload to recover

Or

Access Harvester explorer page https:///dashboard/c/local
Access Workload -> Deployments
Change namespace to cattle-monitoring-system
Scale-down cattle-monitoring-system/rancher-monitoring-grafana of deployment to 0
When dashboard monitoring chart display empty
Scale-up cattle-monitoring-system/rancher-monitoring-grafana of deployment to 1
Click the Reload to recover

TachunLin commented 2 years ago

@n313893254 , @guangbochen While we put harvester cluster idle for about 30 minutes, the Prometheus monitoring chart still become empty. Since the original symptom still exits thus I would suggest moving to implement for further investigation

A good news is we can use Reload button to recover it back.

Support Bundle

supportbundle_e31ad37b-b0e3-4404-97ac-febf66b6ac2f_2022-04-27T14-27-43Z.zip

w13915984028 commented 2 years ago

While checking the support-bundle, I reproduced this issue:

Env: single-node harvester, one guest VM (simplest cirros, webVNC in command: top)

1. The metrics One web-page in VM- VM Metrics, it became blank in about 30 minutes

One web-page in Cluster-metrics

2. nginx log (notice: log time + 2 hours to align with time shown in metrics) except http code 200, 304, there are only code 499

for  d/harvester-vm-detail-1

192.168.122.1 - - [28/Apr/2022:09:10:00 +0000] "GET /api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-grafana:80/proxy/api/datasources/proxy/1/api/v1/query_range?query=irate(kubevirt_vmi_storage_iops_read_total%7Bnamespace%3D%22default%22%2C%20name%3D%22vm2%22%7D%5B5m%5D)&start=1651133400&end=1651137000&step=60 HTTP/2.0" 499 0 "https://192.168.122.144/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-grafana:80/proxy/d/harvester-vm-detail-1/harvester-vm-info-detail?orgId=1&kiosk&from=now-1h&to=now&refresh=10s&var-namespace=default&var-vm=vm2&theme=light" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36" 234 0.008 [cattle-system-rancher-80] [] 10.52.0.100:80 0 0.008 - ba442fb2d0b57903d2da104441b389bd

for  d/rancher-cluster-nodes-1

192.168.122.1 - - [28/Apr/2022:09:33:23 +0000] "GET /api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-grafana:80/proxy/api/datasources/proxy/1/api/v1/query_range?query=1%20-%20sum(node_memory_MemAvailable_bytes%20OR%20windows_os_physical_memory_free_bytes)%20by%20(instance)%20%2F%20sum(node_memory_MemTotal_bytes%20OR%20windows_cs_physical_memory_bytes)%20by%20(instance)%20&start=1651134780&end=1651138380&step=60 HTTP/2.0" 499 0 "https://192.168.122.144/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-grafana:80/proxy/d/rancher-cluster-nodes-1/rancher-cluster-nodes?orgId=1&kiosk&from=now-1h&to=now&refresh=30s&theme=light" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36" 311 0.008 [cattle-system-rancher-80] [] 10.52.0.100:80 0 0.008 - d7a6c8cbea43fb3b35d4c807e64b4c93

3. Prometheus/Grafana: only grafana-sc-dashboard report: Reason: Expired: too old resource version: 3949 (482030)

$ kubectl logs -n cattle-monitoring-system rancher-monitoring-grafana-d9c56d79b-zqgjb grafana-sc-dashboard 

[2022-04-28 08:23:48] Starting collector

[2022-04-28 08:23:53] Working on ADDED configmap cattle-dashboards/rancher-monitoring-persistentvolumesusage
[2022-04-28 09:22:14] ApiException when calling kubernetes: (410)

Reason: Expired: too old resource version: 3949 (482030)

[2022-04-28 09:22:20] Working on ADDED configmap cattle-dashboards/rancher-monitoring-apiserver
[2022-04-28 09:22:20] Contents of apiserver.json haven't changed. Not overwriting existing file

Summary: Need to continue investigate those 2 directions:

Both cluster-metrics and vm-metrics logged http code 499 (client shut off in the middle of processing), but cluster-metrics was alive, vm-metrics was broken. (The "Reload" can solve it)
grafana-sc-dashboard reported "too old resource version: 3949 (482030)"

@n313893254 @WuJun2016

abstracted logs: (notice: log time + 2 hours to align with time shown in metrics) 2150-recheck-2804.txt

w13915984028 commented 2 years ago

Another point needs to be checked:

When "Reload" vm metrics, and select range "1h", the "network traffic" has history data, but IO does not.

Comparing: force-refresh cluster metrics, it has all history data.

Is it possible: Front VM metrics encountered error in "IO" related part, and actively closed the VM metrics? @n313893254

w13915984028 commented 2 years ago

cluster-metrics also became blank, but the "reloaded" vm-metrics keeps working

the grafana-sc-dashboard reported a new Expired: too old resource version: 3967 (515909)

 kk logs -n cm rancher-monitoring-grafana-d9c56d79b-zqgjb grafana-sc-dashboard:

[2022-04-28 10:18:49] ApiException when calling kubernetes: (410)
Reason: Expired: too old resource version: 3967 (515909)

[2022-04-28 10:18:54] Working on ADDED configmap cattle-dashboards/rancher-monitoring-k8s-resources-cluster
[2022-04-28 10:18:54] Contents of k8s-resources-cluster.json haven't changed. Not overwriting existing file

2150-recheck-2804-2.txt

There are tens of http 499 code, with both vm-metrics and cluster-metrics.

most closing to 10:18:

192.168.122.1 - - [28/Apr/2022:10:10:36 +0000] "GET /api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-grafana:80/proxy/api/datasources/proxy/1/api/v1/query_range?query=sum(rate(node_network_transmit_errs_total%7Bdevice!~%22lo%7Cveth.*%7Cdocker.*%7Cflannel.*%7Ccali.*%7Ccbr.*%22%7D%5B240s%5D))%20by%20(instance)%20OR%20sum(rate(windows_net_packets_outbound_errors_total%7Bnic!~%27.*isatap.*%7C.*VPN.*%7C.*Pseudo.*%7C.*tunneling.*%27%7D%5B240s%5D))%20by%20(instance)&start=1651137000&end=1651140600&step=60 HTTP/2.0" 499 0 "https://192.168.122.144/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-grafana:80/proxy/d/rancher-cluster-nodes-1/rancher-cluster-nodes?orgId=1&kiosk&from=now-1h&to=now&refresh=30s&theme=light" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36" 381 0.012 [cattle-system-rancher-80] [] 10.52.0.100:80 0 0.012 - 6fa41c7c46fde829170ba23bc687cad9

192.168.122.1 - - [28/Apr/2022:10:25:34 +0000] "GET /api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-grafana:80/proxy/api/datasources/proxy/1/api/v1/query_range?query=1%20-%20(sum(node_filesystem_free_bytes%7Bdevice!~%22rootfs%7CHarddiskVolume.%2B%22%7D%20OR%20windows_logical_disk_free_bytes%7Bvolume!~%22(HarddiskVolume.%2B%7C%5BA-Z%5D%3A.%2B)%22%7D)%20by%20(instance)%20%2F%20sum(node_filesystem_size_bytes%7Bdevice!~%22rootfs%7CHarddiskVolume.%2B%22%7D%20OR%20windows_logical_disk_size_bytes%7Bvolume!~%22(HarddiskVolume.%2B%7C%5BA-Z%5D%3A.%2B)%22%7D)%20by%20(instance))&start=1651137900&end=1651141500&step=60 HTTP/2.0" 499 0 "https://192.168.122.144/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-grafana:80/proxy/d/rancher-cluster-nodes-1/rancher-cluster-nodes?orgId=1&kiosk&from=now-1h&to=now&refresh=10s&theme=light" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36" 465 0.037 [cattle-system-rancher-80] [] 10.52.0.100:80 0 0.036 - 979bc2daee2259b16c73db8be9826c7f

It looks following logs need to be analyzed first.

grafana-sc-dashboard
...
[2022-04-28 10:18:49] ApiException when calling kubernetes: (410)
Reason: Expired: too old resource version: 3967 (515909)

w13915984028 commented 2 years ago

The version used in Harvester:

                "image": "rancher/mirrored-kiwigrid-k8s-sidecar:1.12.3",
                "imagePullPolicy": "IfNotPresent",
                "name": "grafana-sc-dashboard",

w13915984028 commented 2 years ago

Analysis of ApiException when calling kubernetes: (410):

grafana-sc-dashboard ( rancher/mirrored-kiwigrid-k8s-sidecar:1.12.3 ) watches configmap in namespace cattle-dashboards, when content is changed, it writes to files in /tmp/dashboards, which are further fetched by grafana.

ApiException when calling kubernetes: (410) Reason: Expired: too old resource version: 3967 (515909) means the time out of watching API Server, happens every 30~60 minutes. The sidecar re-watch after time out. Following logs showes that, it gets new data from API Server, but after comparing data, it assumes same and Not overwriting existing file.

https://github.com/kiwigrid/k8s-sidecar/blob/master/src/helpers.py#L68

[2022-04-28 10:18:54] Working on ADDED configmap cattle-dashboards/rancher-monitoring-k8s-resources-cluster
[2022-04-28 10:18:54] Contents of k8s-resources-cluster.json haven't changed. Not overwriting existing file

For now, ApiException when calling kubernetes: (410) looks not bringing harm. @n313893254

w13915984028 commented 2 years ago

A further test with screen recording.

22:40 started with 3 web-page: Harvester cluster metrics, Harvester VM metrics, Grafana VM metrics

all good

22:57 Harvester VM metrics suddenly disappeared

remaing 2:

23:03 Harvester cluster metrics suddenly disappeared

Summary: Grafana direct UI of VM metrics worked all the time. The "embedded?" metrics in Harvester cluster and Harveter VM, disappeared suddenly in different time, and did not reload back automatically. Need further investigation. cc @n313893254

w13915984028 commented 2 years ago

Another detail:

In the screening, at 23:30, Grafana "memory" reported "no data"

After only about 8 secods, it recovered.

WuJun2016 commented 2 years ago

Hi， @w13915984028 I solved part of the problem by fixing the frontend, which tries to reload the chart when it fails to load. （Reload time is about 30-60 seconds）

There is no way to consistently reproduce the causes of this problem. I found that the frontend charts failed to load because the requests sent all failed and did not return any http status codes or error messages.

a similar situation can be reproduced as follows:

kubectl -n cattle-monitoring-system edit deploy rancher-monitoring-grafana
changespec.replicas: 0
when the chart load fails, changespec.replicas: 1 , the chart will load automatically after a while.

WuJun2016 commented 2 years ago

Hi， @w13915984028 Can you help to try to find the reason why api requests may fail at intervals and without any hint?

I think an issue can be recreated to track api request failures if needed。

harvester / harvester

[BUG] The Prometheus monitoring chart become empty after staying on dashboard page for a period of time #2150

2043 may be related too, please take a look, thanks.

Pre Ready-For-Testing Checklist

Test Information

Verify Steps

Support Bundle