Open taam opened 2 years ago
I no longer have a ceph cluster at hand to test this check. How does the output look like on your system? Maybe you can provide same example data directly from the json API, so I can test with?
Here are some examples (output from pvesh, Proxmox 6.4):
"health" : {
"checks" : {},
"status" : "HEALTH_OK"
},
"health" : {
"checks" : {
"POOL_SCRUB_FLAGS" : {
"detail" : [
{
"message" : "Pool foo has noscrub flag"
},
{
"message" : "Pool foo has nodeep-scrub flag"
}
],
"severity" : "HEALTH_OK",
"summary" : {
"message" : "Some pool(s) have the noscrub, nodeep-scrub flag(s) set"
}
}
},
"status" : "HEALTH_OK"
},
"health" : {
"checks" : {
"OSDMAP_FLAGS" : {
"detail" : [],
"severity" : "HEALTH_WARN",
"summary" : {
"message" : "nobackfill,norebalance,norecover flag(s) set"
}
},
"POOL_SCRUB_FLAGS" : {
"detail" : [
{
"message" : "Pool foo has noscrub flag"
},
{
"message" : "Pool foo has nodeep-scrub flag"
}
],
"severity" : "HEALTH_OK",
"summary" : {
"message" : "Some pool(s) have the noscrub, nodeep-scrub flag(s) set"
}
}
},
"status" : "HEALTH_WARN"
},
"health" : {
"checks" : {
"PG_DEGRADED" : {
"detail" : [
{
"message" : "pg 1.0 is stuck undersized for 123.456789, current state active+recovering+undersized+degraded+remapped, last acting [0,2]"
},
{
"message" : "pg 1.1 is stuck undersized for 123.456789, current state active+recovery_wait+undersized+degraded+remapped, last acting [1,2]"
}
],
"severity" : "HEALTH_WARN",
"summary" : {
"message" : "Degraded data redundancy: 12345/123456789 objects degraded (0.123%), 4 pgs degraded, 5 pgs undersized"
}
}
},
"status" : "HEALTH_WARN"
},
For the ceph health check it would be nice to see, which checks are failing. Just as an starting idea, locally I hacked in the following lines at the end of the
check_ceph_health
function (but my python knowledge is rather limited):(Technically it would probably be better to put these details in separate lines.)