rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.37k stars 2.69k forks source link

No metrics from mgr after update to ceph 18.2.1 #13605

Closed krptg0 closed 9 months ago

krptg0 commented 9 months ago

Is this a bug report or feature request?

Deviation from expected behavior: Only part of the wanted metrics are showing up in Grafana.

Expected behavior: All metrics should show up How to reproduce it (minimal and precise):

Don't really know if it's tied to updating from 1.12 File(s) to submit:

Cluster Status to submit:

Environment:


After updating, I still have 2 running mgr's. One of them (not always the active one), have Prometheus enabled, and I can curl localhost:9283 from within the pod. First clue is the HTTP answer : image Not really sure why the metrics are empty.

The other one simply denies my request : image

My Prometheus instance (I tried my historical one in NS "monitoring" from KPS Helm Chart, and the one provided by Rook documentation, leveraging the Prometheus operator), tells me "Connection refused". image

Since the new dashboard is also relying on a Prometheus instance to retrieve metrics for the main Graph, my Dashboard is currently empty and I can't follow anything going on with the cluster. image

ceph-exporter pods are working as intended and are scraped as intended by the "externel" (from rook-ceph NS) Prometheus. I didn't change any configuration on this.

EDIT: mgr.a which is the one not responding with Connection Refused, just gave me this :

❯ k rook-ceph ceph crash info 2024-01-22T14:28:33.933176Z_b1e61cde-5713-4085-88fd-632ba046f68e
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1657, in _oremote\n    return mgr.remote(o, meth, *args, **kwargs)",
        "  File \"/usr/share/ceph/mgr/mgr_module.py\", line 2228, in remote\n    args, kwargs)",
        "ImportError: Module not found",
        "\nDuring handling of the above exception, another exception occurred:\n",
        "Traceback (most recent call last):",
        "  File \"/usr/share/ceph/mgr/prometheus/module.py\", line 649, in __init__\n    self.modify_instance_id = self.get_orch_status() and self.get_module_option(",
        "  File \"/usr/share/ceph/mgr/prometheus/module.py\", line 869, in get_orch_status\n    return self.available()[0]",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1586, in inner\n    completion = self._oremote(method_name, args, kwargs)",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1661, in _oremote\n    f_set = self.get_feature_set()",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1586, in inner\n    completion = self._oremote(method_name, args, kwargs)",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1657, in _oremote\n    return mgr.remote(o, meth, *args, **kwargs)",
        "  File \"/usr/share/ceph/mgr/mgr_module.py\", line 2228, in remote\n    args, kwargs)",
        "ImportError: Module not found"
    ],
    "ceph_version": "18.2.1",
    "crash_id": "2024-01-22T14:28:33.933176Z_b1e61cde-5713-4085-88fd-632ba046f68e",
    "entity_name": "mgr.a",
    "mgr_module": "prometheus",
    "mgr_module_caller": "ActivePyModule::load",
    "mgr_python_exception": "ImportError",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "6f4033c8739c625d3935380854c7945f8bd28267f6a4d03fc2017bc8815c257c",
    "timestamp": "2024-01-22T14:28:33.933176Z",
    "utsname_hostname": "rook-ceph-mgr-a-76c75c5d67-szgdp",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-91-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023"
}

EDIT2:

Just rolled back to 18.2.0, metrics are back. Issue is up in Ceph tracker : https://tracker.ceph.com/issues/64051

travisn commented 9 months ago

This looks the same as #13527. Please read through that issue to see if the workaround helps.

krptg0 commented 9 months ago

Yup, that's exactly the same issue. The workaround using docker.io/rkachach/ceph:v18.2.1_patched_v1 did work.

travisn commented 9 months ago

Yup, that's exactly the same issue. The workaround using docker.io/rkachach/ceph:v18.2.1_patched_v1 did work.

Good to hear it worked, will close this issue then

R-Studio commented 4 months ago

@travisn & @krptg0 any news on this?

krptg0 commented 4 months ago

@R-Studio it has been fixed in v18.2.2 I think, everything works out the box now, with the latest releases (that's why the issue's been closed)

R-Studio commented 4 months ago

@krptg0 Thank you very much, I will update it next week. 😉