Ceph health messages - Githubissues

taam commented 2 years ago

For the ceph health check it would be nice to see, which checks are failing. Just as an starting idea, locally I hacked in the following lines at the end of the check_ceph_health function (but my python knowledge is rather limited):

        messages = ", ".join(v['summary']['message'] for k, v in ceph_health.get('checks', {}).items())
        if len(messages) > 0:
            self.check_message += ": " + messages

(Technically it would probably be better to put these details in separate lines.)

nbuchwitz commented 2 years ago

I no longer have a ceph cluster at hand to test this check. How does the output look like on your system? Maybe you can provide same example data directly from the json API, so I can test with?

taam commented 2 years ago

Here are some examples (output from pvesh, Proxmox 6.4):

   "health" : {
      "checks" : {},
      "status" : "HEALTH_OK"
   },

   "health" : {
      "checks" : {
         "POOL_SCRUB_FLAGS" : {
            "detail" : [
               {
                  "message" : "Pool foo has noscrub flag"
               },
               {
                  "message" : "Pool foo has nodeep-scrub flag"
               }
            ],
            "severity" : "HEALTH_OK",
            "summary" : {
               "message" : "Some pool(s) have the noscrub, nodeep-scrub flag(s) set"
            }
         }
      },
      "status" : "HEALTH_OK"
   },

   "health" : {
      "checks" : {
         "OSDMAP_FLAGS" : {
            "detail" : [],
            "severity" : "HEALTH_WARN",
            "summary" : {
               "message" : "nobackfill,norebalance,norecover flag(s) set"
            }
         },
         "POOL_SCRUB_FLAGS" : {
            "detail" : [
               {
                  "message" : "Pool foo has noscrub flag"
               },
               {
                  "message" : "Pool foo has nodeep-scrub flag"
               }
            ],
            "severity" : "HEALTH_OK",
            "summary" : {
               "message" : "Some pool(s) have the noscrub, nodeep-scrub flag(s) set"
            }
         }
      },
      "status" : "HEALTH_WARN"
   },

   "health" : {
      "checks" : {
         "PG_DEGRADED" : {
            "detail" : [
               {
                  "message" : "pg 1.0 is stuck undersized for 123.456789, current state active+recovering+undersized+degraded+remapped, last acting [0,2]"
               },
               {
                  "message" : "pg 1.1 is stuck undersized for 123.456789, current state active+recovery_wait+undersized+degraded+remapped, last acting [1,2]"
               }
            ],
            "severity" : "HEALTH_WARN",
            "summary" : {
               "message" : "Degraded data redundancy: 12345/123456789 objects degraded (0.123%), 4 pgs degraded, 5 pgs undersized"
            }
         }
      },
      "status" : "HEALTH_WARN"
   },

nbuchwitz / check_pve

Ceph health messages #31