rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
23.25k stars 2.95k forks source link

[BUG] Project monitor status got stuck at "WaitingForDashboardValues". #47155

Open pankajsqe opened 3 days ago

pankajsqe commented 3 days ago

Rancher Server Setup

Information about the Cluster

User Information

Describe the bug While the installing Prometheus Federator 105.0.0-rc.1+up0.4.2 on Rancher 2.10-head via Docker I encountered a problem when setting up the project monitor. The setup process is stuck with the status "WaitingForDashboardValues"

Additionally, the corresponding pod, helm-install-cattle-project-p-7mj9v-monitoring-jwr97, is throwing the following

Reorg the helm-locker repo to merge into helm-project-operator repo #94

error:

Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: Namespace "cattle-dashboards" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "cattle-project-p-7mj9v-monitoring": current value is "rancher-monitoring"; annotation validation error: key "meta.helm.sh/release-namespace" must equal "cattle-project-p-7mj9v-monitoring": current value is "cattle-monitoring-system"

Details

To Reproduce Steps:

Result Project monitor setup process got stuck with the status "WaitingForDashboardValues"

Expected Result project monitor should be created successfully.

Screenshots

Screenshot 2024-09-16 at 7 43 48 PM Screenshot 2024-09-16 at 7 43 43 PM Screenshot 2024-09-16 at 7 42 52 PM
mallardduck commented 9 hours ago

So this issue goes back further than my changes do. I've been able to replicate the issue on the following setup:

Rancher: v2.10-a1a0d2edf04548b0a9099f86bcf8c194771db8f8-head k8s: k3s - v1.27.7+k3s2 Storage: longhorn:104.1.0+up1.6.21 Monitoring: rancher-monitoring:105.1.0-rc.4+up61.3.2 Federator: prometheus-federator:104.0.1+up0.4.2 (prom-fed embededed helm-controller disabled) Workload: a random pihole chart (doesn't matter but used a real one for extra realism)

And I see:

Screenshot 2024-09-20 at 10 15 15 AM

Update: After chatting in slack threads I tested with the prom-fed embeded helm-controller enabled and it worked almost instantly. Of note, before I enabled the prom-fed helm-controller I didn't have pods for installing things at all. Further chats revealed that there may not be a need to disable the embedded on on k3s, but you do on rke2 versions. I will deploy a downstream rke2 to test with instead of the k3s one I had on hand.


Update 2: Testing on RKE2 with helm-controller enabled yield similar results to k3s. If I disabled it like the docs might imply, then it fails to create pods to install things. Sounds like Julia and Pankaj left this value on for all their tests - so I'll do the same from now on. Final test will be using my fork.

mallardduck commented 5 hours ago

This seems to be a potential incompatibility between rancher-monitoring and the rancher-project-monitoring chart that gets deployed. In my fork/branch it is deploying from 0.4.2 chart version and the last working version of the chart (in my testing) is the 0.3.x versions of the chart.

I’ve found that starting in 0.4.0 the chart started to include the namespace as part of the charts templates file. And by adding .grafana.defaultDashboards.useExistingNamespace and setting it to true this resolves the currently reported bug. However there are additional errors I'm seeing still that I'm suspecting may be related to 0.4.x versus 0.3.x chart differences in general.


Additionally, to fix image pull issues in 0.4.2 chart it's necessary to add: .global.imageRegistry to ensure it pulls all images correctly. So a complete workaround would be to add:

    global:
      imageRegistry: docker.io
    grafana:
      defaultDashboards:
        useExistingNamespace: true

into the ProjectMonitor when you create it. Making sure to merge grafana values with the existing key for it.

mallardduck commented 3 hours ago

To enable easier debugging I've produced images with the following tags:

Each version suffix correlates to the version of the rancher-project-monitoring chart that is embedded into the prometheus-federator binary/image.

In order to test images individually without resetting everything, it's possible to edit the prometheus-federator App.