Open JeffreyDevloo opened 7 years ago
Some more details: Timestamp on when I sighted the issue: 1480499280
"result": [
{
"namespace": "dd46326c-2fef-4d51-9b05-87f8a4369101",
"safety": -2,
"safety_count": 999,
"bucket_safety": [
{
"bucket": [ 2, 2, 4, 3 ],
"count": 18,
"applicable_dead_osds": 6,
"remaining_safety": -4
},
{
"bucket": [ 2, 2, 4, 2 ],
"count": 999,
"applicable_dead_osds": 4,
"remaining_safety": -2
}
]
},
list osds Before writing more data: http://pastebin.com/JFvr4n0T After writing more data http://pastebin.com/zQhFC6rg
I think this issue and https://github.com/openvstorage/alba/issues/312 are related. Both show that osd read/write is insufficiently updated ... maintenance should periodically put load on the asds, which should then be reflected in the list-osds output. Something goes wrong in that flow and results in these 2 tickets.
1478684482 => Wed Nov 9 10:41:22 CET 2016
1478683008 => Wed Nov 9 10:16:48 CET 2016
Problem description
We were executing the following sequence to test get-disk-safety:
After the removal of the disks we found:
and even later it got worse to -5 (but a get-disk-safety in between showed no errors):
We suspect that the culprit is that the timestamp is not updated by the maintenance agent. The time that we got during testing was:
The timestamps in list-osds however:
Temporary fix
Restarting the maintenance agent seems to have fixed to problem.
Setup
Hyperconverged setup with 3 nodes Each node has 6 disks for the backend On node 1 and 2 each disk is used for 3 asds On node 3 each disk is used for 4 asds The preset we are using is the default one: 5,4,8,3
Packages