No metrics from mgr after update to ceph 18.2.1

krptg0 commented 9 months ago

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: Only part of the wanted metrics are showing up in Grafana.

Expected behavior: All metrics should show up How to reproduce it (minimal and precise):

Don't really know if it's tied to updating from 1.12 File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph # namespace:cluster
spec:
cephVersion:
image: quay.io/ceph/ceph:v18.2.1-20240103
allowUnsupported: false
dataDirHostPath: /var/lib/rook
skipUpgradeChecks: false
continueUpgradeAfterChecksEvenIfNotHealthy: false
waitTimeoutForHealthyOSDInMinutes: 10
mon:
count: 3
allowMultiplePerNode: false
mgr:
count: 2
allowMultiplePerNode: false
modules:
  - name: pg_autoscaler
    enabled: true
  - name: rook
    enabled: true
dashboard:
enabled: true
ssl: false
prometheusEndpoint: http://kube-prometheus-stack-prometheus.monitoring.svc:9090
monitoring:
enabled: true
network:
connections:
  encryption:
    enabled: false
  compression:
    enabled: false
crashCollector:
disable: false
logCollector:
enabled: true
periodicity: daily 
maxLogSize: 500M 
cleanupPolicy:
confirmation: ""
sanitizeDisks:
  method: quick
  dataSource: zero
  iteration: 1
allowUninstallWithVolumes: false
resources:
api:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"
mgr:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"
mon:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"
osd:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "1750Mi"  
removeOSDsIfOutAndSafeToRemove: false
priorityClassNames:
mon: system-node-critical
osd: system-node-critical
mgr: system-cluster-critical
storage:
useAllNodes: true
useAllDevices: true
onlyApplyOSDPlacement: false
disruptionManagement:
managePodBudgets: true
osdMaintenanceTimeout: 30
pgHealthCheckTimeout: 0
manageMachineDisruptionBudgets: false
machineDisruptionBudgetNamespace: openshift-machine-api
healthCheck:
daemonHealth:
  mon:
    disabled: false
    interval: 45s
  osd:
    disabled: false
    interval: 60s
  status:
    disabled: false
    interval: 60s
livenessProbe:
  mon:
    disabled: false
  mgr:
    disabled: false
  osd:
    disabled: false
startupProbe:
  mon:
    disabled: false
  mgr:
    disabled: false
  osd:
    disabled: false

Logs to submit:

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read GitHub documentation if you need help.

Cluster Status to submit:

Output of kubectl commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health To get the status of the cluster, use kubectl rook-ceph ceph status For more details, see the Rook kubectl Plugin

Environment:

OS (e.g. from /etc/os-release): Ubuntu 22.04
Kernel (e.g. uname -a):
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod): 1.13.2
Storage backend version (e.g. for ceph do ceph -v): 18.2.1
Kubernetes version (use kubectl version):
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): OK

After updating, I still have 2 running mgr's. One of them (not always the active one), have Prometheus enabled, and I can curl localhost:9283 from within the pod. First clue is the HTTP answer : Not really sure why the metrics are empty.

The other one simply denies my request :

My Prometheus instance (I tried my historical one in NS "monitoring" from KPS Helm Chart, and the one provided by Rook documentation, leveraging the Prometheus operator), tells me "Connection refused".

Since the new dashboard is also relying on a Prometheus instance to retrieve metrics for the main Graph, my Dashboard is currently empty and I can't follow anything going on with the cluster.

ceph-exporter pods are working as intended and are scraped as intended by the "externel" (from rook-ceph NS) Prometheus. I didn't change any configuration on this.

EDIT: mgr.a which is the one not responding with Connection Refused, just gave me this :

❯ k rook-ceph ceph crash info 2024-01-22T14:28:33.933176Z_b1e61cde-5713-4085-88fd-632ba046f68e
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1657, in _oremote\n    return mgr.remote(o, meth, *args, **kwargs)",
        "  File \"/usr/share/ceph/mgr/mgr_module.py\", line 2228, in remote\n    args, kwargs)",
        "ImportError: Module not found",
        "\nDuring handling of the above exception, another exception occurred:\n",
        "Traceback (most recent call last):",
        "  File \"/usr/share/ceph/mgr/prometheus/module.py\", line 649, in __init__\n    self.modify_instance_id = self.get_orch_status() and self.get_module_option(",
        "  File \"/usr/share/ceph/mgr/prometheus/module.py\", line 869, in get_orch_status\n    return self.available()[0]",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1586, in inner\n    completion = self._oremote(method_name, args, kwargs)",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1661, in _oremote\n    f_set = self.get_feature_set()",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1586, in inner\n    completion = self._oremote(method_name, args, kwargs)",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1657, in _oremote\n    return mgr.remote(o, meth, *args, **kwargs)",
        "  File \"/usr/share/ceph/mgr/mgr_module.py\", line 2228, in remote\n    args, kwargs)",
        "ImportError: Module not found"
    ],
    "ceph_version": "18.2.1",
    "crash_id": "2024-01-22T14:28:33.933176Z_b1e61cde-5713-4085-88fd-632ba046f68e",
    "entity_name": "mgr.a",
    "mgr_module": "prometheus",
    "mgr_module_caller": "ActivePyModule::load",
    "mgr_python_exception": "ImportError",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "6f4033c8739c625d3935380854c7945f8bd28267f6a4d03fc2017bc8815c257c",
    "timestamp": "2024-01-22T14:28:33.933176Z",
    "utsname_hostname": "rook-ceph-mgr-a-76c75c5d67-szgdp",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-91-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023"
}

EDIT2:

Just rolled back to 18.2.0, metrics are back. Issue is up in Ceph tracker : https://tracker.ceph.com/issues/64051

travisn commented 9 months ago

This looks the same as #13527. Please read through that issue to see if the workaround helps.

krptg0 commented 9 months ago

Yup, that's exactly the same issue. The workaround using docker.io/rkachach/ceph:v18.2.1_patched_v1 did work.

travisn commented 9 months ago

Yup, that's exactly the same issue. The workaround using docker.io/rkachach/ceph:v18.2.1_patched_v1 did work.

Good to hear it worked, will close this issue then

R-Studio commented 4 months ago

@travisn & @krptg0 any news on this?

krptg0 commented 4 months ago

@R-Studio it has been fixed in v18.2.2 I think, everything works out the box now, with the latest releases (that's why the issue's been closed)

R-Studio commented 4 months ago

@krptg0 Thank you very much, I will update it next week. 😉

rook / rook

No metrics from mgr after update to ceph 18.2.1 #13605