get-disk-safety --include-errored-as-dead returns weird results

JeffreyDevloo commented 7 years ago

Problem description

We were executing the following sequence to test get-disk-safety:

Execute get-disk-safety
Remove disks
Execute get-disk-safety

After the removal of the disks we found:

root@ovs-node-1:~# alba get-disk-safety --config arakoon://config/ovs/arakoon/mybackend01-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --to-json --include-errored-as-dead
{
  "success": true,
  "result": [
    {
      "namespace": "0749d85b-4507-4093-a89f-3aebe3dfcd5f",
      "safety": 1,
      "safety_count": 1507,
      "bucket_safety": [
        {
          "bucket": [ 5, 4, 7, 3 ],
          "count": 14,
          "applicable_dead_osds": 3,
          "remaining_safety": -1
        },
        {
          "bucket": [ 5, 4, 8, 3 ],
          "count": 225,
          "applicable_dead_osds": 3,
          "remaining_safety": 0
        },
        {
          "bucket": [ 5, 4, 9, 3 ],
          "count": 1507,
          "applicable_dead_osds": 3,
          "remaining_safety": 1
        }
      ]
    },
    {
      "namespace": "4cda730a-1221-4047-84b4-c683c2b9b42f",
      "safety": 1,
      "safety_count": 31,
      "bucket_safety": [
        {
          "bucket": [ 5, 4, 9, 3 ],
          "count": 31,
          "applicable_dead_osds": 3,
          "remaining_safety": 1
        }
      ]
    },
    {
      "namespace": "fd-myvpool01-658e90b5-f392-457a-ad37-a1d142eb30d6",
      "safety": null,
      "safety_count": null,
      "bucket_safety": []
    }
  ]
}

and even later it got worse to -5 (but a get-disk-safety in between showed no errors):

root@ovs-node-1:~# alba get-disk-safety --config arakoon://config/ovs/arakoon/mybackend01-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --to-json --include-errored-as-dead
{
  "success": true,
  "result": [
    {
      "namespace": "0749d85b-4507-4093-a89f-3aebe3dfcd5f",
      "safety": 1,
      "safety_count": 1561,
      "bucket_safety": [
        {
          "bucket": [ 5, 4, 8, 3 ],
          "count": 185,
          "applicable_dead_osds": 3,
          "remaining_safety": 0
        },
        {
          "bucket": [ 5, 4, 9, 3 ],
          "count": 1561,
          "applicable_dead_osds": 3,
          "remaining_safety": 1
        }
      ]
    },
    {
      "namespace": "4cda730a-1221-4047-84b4-c683c2b9b42f",
      "safety": 1,
      "safety_count": 31,
      "bucket_safety": [
        {
          "bucket": [ 5, 4, 9, 3 ],
          "count": 31,
          "applicable_dead_osds": 3,
          "remaining_safety": 1
        }
      ]
    },
    {
      "namespace": "fd-myvpool01-658e90b5-f392-457a-ad37-a1d142eb30d6",
      "safety": null,
      "safety_count": null,
      "bucket_safety": []
    }
  ]
}
root@ovs-node-1:~# alba get-disk-safety --config arakoon://config/ovs/arakoon/mybackend01-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --to-json --include-errored-as-dead
{
  "success": true,
  "result": [
    {
      "namespace": "0749d85b-4507-4093-a89f-3aebe3dfcd5f",
      "safety": -5,
      "safety_count": 1746,
      "bucket_safety": [
        {
          "bucket": [ 5, 4, 9, 3 ],
          "count": 1746,
          "applicable_dead_osds": 9,
          "remaining_safety": -5
        }
      ]
    },
    {
      "namespace": "4cda730a-1221-4047-84b4-c683c2b9b42f",
      "safety": -5,
      "safety_count": 31,
      "bucket_safety": [
        {
          "bucket": [ 5, 4, 9, 3 ],
          "count": 31,
          "applicable_dead_osds": 9,
          "remaining_safety": -5
        }
      ]
    },
    {
      "namespace": "fd-myvpool01-658e90b5-f392-457a-ad37-a1d142eb30d6",
      "safety": null,
      "safety_count": null,
      "bucket_safety": []
    }
  ]
}

We suspect that the culprit is that the timestamp is not updated by the maintenance agent. The time that we got during testing was:

1478684482

The timestamps in list-osds however:

root@ovs-node-1:~# alba list-osds --config arakoon://config/ovs/arakoon/mybackend01-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --to-json
{
  "success": true,
  "result": [
    {
      "id": 0,
      "alba_id": "c5f952c3-986b-40f9-9308-80190c64235a",
      "kind": "AsdV1",
      "ips": [ "10.100.199.151", "172.22.199.151" ],
      "port": 8617,
      "use_tls": false,
      "use_rdma": false,
      "albamgr_cfg": null,
      "prefix": null,
      "preset": null,
      "decommissioned": false,
      "node_id": "Qtu8XMtU7JDMQnIfK2bF2zbp7ZGEAZT6",
      "long_id": "bJN8USie0PXvvIv4z2PXQYnoYIHf6QNy",
      "total": 3490709504,
      "used": 175942762,
      "seen": [
        1478683008.763784, 1478682998.766403, 1478682998.764081,
        1478682998.76354, 1478682998.762922, 1478682988.767474,
        1478682988.763846, 1478682988.763514, 1478682988.763347,
        1478682988.763236
      ],
      "read": [
        1478683006.600372, 1478683004.995302, 1478683004.079664,
        1478683001.083082, 1478683000.103699, 1478682999.794022,
        1478682930.439348, 1478682927.37164, 1478682926.118236,
        1478682924.233188
      ],
      "write": [
        1478683005.881718, 1478683004.382988, 1478683003.216147,
        1478683003.095716, 1478683003.029198, 1478683002.845802,
        1478683002.743254, 1478683002.505302, 1478683001.919139,
        1478683000.937568
      ],
      "errors": []
    },

Temporary fix

Restarting the maintenance agent seems to have fixed to problem.

Setup

Hyperconverged setup with 3 nodes Each node has 6 disks for the backend On node 1 and 2 each disk is used for 3 asds On node 3 each disk is used for 4 asds The preset we are using is the default one: 5,4,8,3

Packages

alba 1.0.0 amd64 the ALternative BAckend
openvstorage 2.7.5-rev.4281.6612d8c-1 amd64 openvStorage
openvstorage-backend 1.7.5-rev.811.c1b83e6-1 amd64 openvStorage Backend plugin
openvstorage-backend-core 1.7.5-rev.811.c1b83e6-1 amd64 openvStorage Backend plugin core
openvstorage-backend-webapps 1.7.5-rev.811.c1b83e6-1 amd64 openvStorage Backend plugin Web Applications
openvstorage-core 2.7.5-rev.4281.6612d8c-1 amd64 openvStorage core
openvstorage-hc 1.7.5-rev.811.c1b83e6-1 amd64 openvStorage Backend plugin HyperConverged
openvstorage-health-check 3.1.2-rev.209.3f31a72-1 amd64 Open vStorage HealthCheck
openvstorage-sdm 1.6.5-rev.423.713f250-1 amd64 Open vStorage Backend ASD Manager
openvstorage-webapps 2.7.5-rev.4281.6612d8c-1 amd64 openvStorage Web Applications

JeffreyDevloo commented 7 years ago

Some more details: Timestamp on when I sighted the issue: 1480499280

  "result": [
    {
      "namespace": "dd46326c-2fef-4d51-9b05-87f8a4369101",
      "safety": -2,
      "safety_count": 999,
      "bucket_safety": [
        {
          "bucket": [ 2, 2, 4, 3 ],
          "count": 18,
          "applicable_dead_osds": 6,
          "remaining_safety": -4
        },
        {
          "bucket": [ 2, 2, 4, 2 ],
          "count": 999,
          "applicable_dead_osds": 4,
          "remaining_safety": -2
        }
      ]
    },

list osds Before writing more data: http://pastebin.com/JFvr4n0T After writing more data http://pastebin.com/zQhFC6rg

selection_077

domsj commented 7 years ago

I think this issue and https://github.com/openvstorage/alba/issues/312 are related. Both show that osd read/write is insufficiently updated ... maintenance should periodically put load on the asds, which should then be reflected in the list-osds output. Something goes wrong in that flow and results in these 2 tickets.

toolslive commented 7 years ago

1478684482  => Wed Nov  9 10:41:22 CET 2016
1478683008  => Wed Nov  9 10:16:48 CET 2016

openvstorage / alba