Handle insertion of replacement disks

sleinen commented 8 years ago

A tricky part in our current (mostly manual) process is noticing when the service supplier replaces the disk. But presumably this is something that can be automated: When a disk is inserted, the OS will discover it and log something; we could have a periodic script that discovers this event in the log and kicks the process. Or we could even insert such a script as a "hook" that is called by the OS right when the new device is discovered. We except a few replacement disks in the next days/weeks, so that will be a great opportunity to develop and test such automation.

sleinen commented 8 years ago

This is an example of what is written to /var/log/kern.log when a disk is replaced. The first lines appear when the old (broken) disk is removed:

Jul  7 15:38:17 zhdk0053 kernel: [5031780.526635] sd 10:0:4:0: [sdg] Synchronizing SCSI cache
Jul  7 15:38:17 zhdk0053 kernel: [5031780.526758] sd 10:0:4:0: [sdg]
Jul  7 15:38:17 zhdk0053 kernel: [5031780.526763] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul  7 15:38:17 zhdk0053 kernel: [5031780.527088] mpt2sas0: removing handle(0x000d), sas_addr(0x4433221104000000)

When the new disk is inserted, we see:

Jul  7 15:42:19 zhdk0053 kernel: [5032022.700282] scsi 10:0:8:0: Direct-Access     ATA      WDC WD4000F9YZ-0 1A02 PQ: 0 ANSI: 6
Jul  7 15:42:19 zhdk0053 kernel: [5032022.700294] scsi 10:0:8:0: SATA: handle(0x000d), sas_addr(0x4433221104000000), phy(4), device_name(0x50014ee003fa557e)
Jul  7 15:42:19 zhdk0053 kernel: [5032022.700298] scsi 10:0:8:0: SATA: enclosure_logical_id(0x5001e67d940d9000), slot(4)
Jul  7 15:42:19 zhdk0053 kernel: [5032022.700388] scsi 10:0:8:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Jul  7 15:42:19 zhdk0053 kernel: [5032022.700396] scsi 10:0:8:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
Jul  7 15:42:19 zhdk0053 kernel: [5032022.700802] sd 10:0:8:0: Attached scsi generic sg6 type 0
Jul  7 15:42:19 zhdk0053 kernel: [5032022.701277] sd 10:0:8:0: [sdg] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
Jul  7 15:42:19 zhdk0053 kernel: [5032022.701281] sd 10:0:8:0: [sdg] 4096-byte physical blocks
Jul  7 15:42:19 zhdk0053 kernel: [5032022.707410] sd 10:0:8:0: [sdg] Write Protect is off
Jul  7 15:42:19 zhdk0053 kernel: [5032022.707416] sd 10:0:8:0: [sdg] Mode Sense: 7f 00 10 08
Jul  7 15:42:19 zhdk0053 kernel: [5032022.708361] sd 10:0:8:0: [sdg] Write cache: enabled, read cache: enabled, supports DPO and FUA
Jul  7 15:42:19 zhdk0053 kernel: [5032022.736997]  sdg: sdg1
Jul  7 15:42:19 zhdk0053 kernel: [5032022.771422] sd 10:0:8:0: [sdg] Attached SCSI disk

Note how the SCSI address changed: The old device was target 8, the new one target 4.

sleinen commented 7 years ago

Here's an example message that quotes the insertion log messages it is based on. It could serve as an example of the kind of output we could generate and what actions the script should support.

The disk came online as /dev/sde:

  [Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: Direct-Access     ATA      WDC WD4000F9YZ-0 1A02 PQ: 0 ANSI: 6
  [Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: SATA: handle(0x000b), sas_addr(0x4433221101000000), phy(1), device_name(0x50014ee004236411)
  [Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: SATA: enclosure_logical_id(0x5001e67a2b9b8000), slot(1)
  [Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
  [Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
  [Tue Jan 24 15:36:22 2017] sd 0:0:4:0: Attached scsi generic sg4 type 0
  [Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
  [Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] 4096-byte physical blocks
  [Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] Write Protect is off
  [Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] Mode Sense: 7f 00 10 08
  [Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
  [Tue Jan 24 15:36:22 2017]  sde: unknown partition table
  [Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] Attached SCSI disk

And is identified as:

Show quoted text        Serial Number:      WD-WCC5D4CLCZJF
  Logical Unit WWN Device Identifier: 50014ee004236411

Switched off the LED.

Added the new WWN to the block-device udev mapping as dde and ran

  sudo udevadm control --reload
  sudo udevadm trigger

Verified that the /dev/dde symlink was created (ls -l /dev/dde*)

Re-ran Puppet to create the OSD

  sudo puppet agent -t

sleinen / diskonade

Handle insertion of replacement disks #3