openvstorage / openvstorage-health-check

The health check is classified as a monitoring and detection tool for Open vStorage.
3 stars 7 forks source link

Disk safety says 1 disk is lost but it isn't #254

Closed kinvaris closed 7 years ago

kinvaris commented 7 years ago

Problem description

Disk safety says 1 disk is lost but it isn't

Logs

[FAILED] Backend mybackend-global has lost 1 disk(s). Losing more disks will cause data loss!
[WARNING] Backend mybackend-global: 21 out of 21 objects have to be repaired.
[FAILED] Amount of objects to repair is increasing for backend_name mybackend-global.

{
      "namespace": "667560ac-bf64-43cc-aa3c-bc1b7354a44a",
      "safety": 1,
      "safety_count": 21,
      "bucket_safety": [
        {
          "bucket": [ 1, 2, 2, 1 ],
          "count": 21,
          "applicable_dead_osds": 0,
          "remaining_safety": 1
        }
      ]
 }

Possible solution

As seen in the logs we see the namespace has a disksafety of 1 because 1,2,2 but in reality it should be 1,2,3 (in alba itself). This causes the hc to throw a failure with 1 disk lost. But in reality there are no disk lost because applicable_dead_osds is not taken in account. Its on 0 because we only have 2 osds instead of 3.

Additional information

Setup

Hyperconverged

Packages

ii  blktap-openvstorage-utils            2.0.90-2ubuntu5                     amd64        utilities to work with VHD disk images files
ii  libblktapctl0-openvstorage           2.0.90-2ubuntu5                     amd64        Xen API blktapctl shared library (shared library)
ii  libvhd0-openvstorage                 2.0.90-2ubuntu5                     amd64        VHD file format access library
ii  libvhdio-2.0.90-openvstorage         2.0.90-2ubuntu5                     amd64        Xen API blktap shared library (shared library)
ii  openvstorage                         2.7.9.2-1                           amd64        openvStorage
ii  openvstorage-backend                 1.7.9.1-1                           amd64        openvStorage Backend plugin
ii  openvstorage-backend-core            1.7.9.1-1                           amd64        openvStorage Backend plugin core
ii  openvstorage-backend-webapps         1.7.9.1-1                           amd64        openvStorage Backend plugin Web Applications
ii  openvstorage-core                    2.7.9.2-1                           amd64        openvStorage core
ii  openvstorage-hc                      1.7.9.1-1                           amd64        openvStorage Backend plugin HyperConverged
ii  openvstorage-health-check            3.1.5-rev.412.2d1379c-1             amd64        Open vStorage HealthCheck
ii  openvstorage-sdm                     1.6.9.1-1                           amd64        Open vStorage Backend ASD Manager
ii  openvstorage-webapps                 2.7.9.2-1                           amd64        openvStorage Web Applications
kinvaris commented 7 years ago

This should do the trick: https://github.com/openvstorage/openvstorage-health-check/commit/6deb3146bf4a656f3d4c390dd0457e7d33664c45

This commit is without caching because we've not taken in account that namespaces that have been in the cache need to be deleted. This will be in a future release

kinvaris commented 7 years ago

Output of attended:

root@ovs-node01-1604:~# ovs healthcheck alba disk-safety
[INFO] Checking disk safety on backend: mybackend02
[INFO] Checking policy `1,2` with max. disk safety `2`
[SUCCESS] All data is safe on backend `mybackend02` with `6` namespace(s)
[INFO] Checking disk safety on backend: mybackend
[INFO] Checking policy `1,2` with max. disk safety `2`
[SUCCESS] All data is safe on backend `mybackend` with `6` namespace(s)
[INFO] Checking disk safety on backend: mybackend-global
[INFO] Checking policy `1,2` with max. disk safety `2`
[WARNING] The disk safety of `5` namespace(s) is `1`, max. disk safety is `2`: 
06773eb5-c56e-4319-88ff-b2fc7d0140b6 with 100% of its objects,
4368bb53-c3a2-47fa-881d-cbb52deed282 with 100% of its objects,
cfa76aec-9687-4a69-9a12-261d55d805a1 with 100% of its objects,
e0aaab1c-2197-41ed-b4d4-a489c9ab24b0 with 100% of its objects,
e88c88c9-632c-4975-b39f-e9993e352560 with 100% of its objects
[INFO] Recap of alba disk-safety!
[INFO] ======================
[INFO] SUCCESS=2 FAILED=0 SKIPPED=0 WARNING=1 EXCEPTION=0

Unattended:

root@ovs-node01-1604:~# ovs healthcheck alba disk-safety unattended
disk-safety-mybackend SUCCESS
disk-safety-mybackend-global WARNING
disk-safety-mybackend02 SUCCESS

Silent:

  'disk-safety-mybackend': 'SUCCESS',
  'disk-safety-mybackend-global': 'WARNING',
  'disk-safety-mybackend02': 'SUCCESS',
kinvaris commented 7 years ago

https://github.com/openvstorage/openvstorage-health-check/pull/255