kerubistan / kerub

A lightweight IaaS prototype
Apache License 2.0
13 stars 4 forks source link

monitor physical disks #242

Open K0zka opened 5 years ago

K0zka commented 5 years ago

check the drives periodically for predicted errors, create a problem when a drive fails

K0zka commented 5 years ago

smartmontools

K0zka commented 5 years ago

TODO

  1. OpenIndiana and Windows integration missing (windows: smartmontools available for cygwin)
  2. Problem must be created when any negative disk health indication done
  3. There should be step factories to provide solutions for the problem, e.g. migrate storage, remove the disk from the VG, and so on (remove the disk from VG is done, same for gvinum is not done yet)
  4. There should be a few test stories about disk failure scenarios with all kinds of storage solutions (fs, lvm, gvinum)
  5. The story should be also introduced in kerub-ext-tests, but no idea whether or not we can emulate a SMART disk failure with qemu
K0zka commented 5 years ago

Windows: wmic diskdrive get status should theoretically do the trick

K0zka commented 5 years ago

Reminder

When there is an lvm VG with single PV in it, which signals SMART failure, then we can not remove the PV (one should always remain). There is no operation for completely removing the VG either. But this leads to keep having the disk failure detected as a problem.

The idea for this is that if it is the only PV in the VG and there are no virtual storage allocations on it, the problem detector should consider it fine. I will do this if tomorrow I still believe this is a good idea.

K0zka commented 5 years ago

but in any case a storage failure should create an alert #51

K0zka commented 5 years ago

Supressing the problem detection if there is only one failing PV in the VG does not seem to be such a good idea. But what else could one do:

K0zka commented 5 years ago

For testing, some documentation here https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt

K0zka commented 5 years ago

Reminder

Right now FS does not track any information about the backing block device. Therefore when the a storage device is signals future problem, kerub could evacuate the filesystem, but it does not even know it should.