Open sleinen opened 8 years ago
This is an example of what is written to /var/log/kern.log
when a disk is replaced. The first lines appear when the old (broken) disk is removed:
Jul 7 15:38:17 zhdk0053 kernel: [5031780.526635] sd 10:0:4:0: [sdg] Synchronizing SCSI cache
Jul 7 15:38:17 zhdk0053 kernel: [5031780.526758] sd 10:0:4:0: [sdg]
Jul 7 15:38:17 zhdk0053 kernel: [5031780.526763] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 7 15:38:17 zhdk0053 kernel: [5031780.527088] mpt2sas0: removing handle(0x000d), sas_addr(0x4433221104000000)
When the new disk is inserted, we see:
Jul 7 15:42:19 zhdk0053 kernel: [5032022.700282] scsi 10:0:8:0: Direct-Access ATA WDC WD4000F9YZ-0 1A02 PQ: 0 ANSI: 6
Jul 7 15:42:19 zhdk0053 kernel: [5032022.700294] scsi 10:0:8:0: SATA: handle(0x000d), sas_addr(0x4433221104000000), phy(4), device_name(0x50014ee003fa557e)
Jul 7 15:42:19 zhdk0053 kernel: [5032022.700298] scsi 10:0:8:0: SATA: enclosure_logical_id(0x5001e67d940d9000), slot(4)
Jul 7 15:42:19 zhdk0053 kernel: [5032022.700388] scsi 10:0:8:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Jul 7 15:42:19 zhdk0053 kernel: [5032022.700396] scsi 10:0:8:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
Jul 7 15:42:19 zhdk0053 kernel: [5032022.700802] sd 10:0:8:0: Attached scsi generic sg6 type 0
Jul 7 15:42:19 zhdk0053 kernel: [5032022.701277] sd 10:0:8:0: [sdg] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
Jul 7 15:42:19 zhdk0053 kernel: [5032022.701281] sd 10:0:8:0: [sdg] 4096-byte physical blocks
Jul 7 15:42:19 zhdk0053 kernel: [5032022.707410] sd 10:0:8:0: [sdg] Write Protect is off
Jul 7 15:42:19 zhdk0053 kernel: [5032022.707416] sd 10:0:8:0: [sdg] Mode Sense: 7f 00 10 08
Jul 7 15:42:19 zhdk0053 kernel: [5032022.708361] sd 10:0:8:0: [sdg] Write cache: enabled, read cache: enabled, supports DPO and FUA
Jul 7 15:42:19 zhdk0053 kernel: [5032022.736997] sdg: sdg1
Jul 7 15:42:19 zhdk0053 kernel: [5032022.771422] sd 10:0:8:0: [sdg] Attached SCSI disk
Note how the SCSI address changed: The old device was target 8, the new one target 4.
Here's an example message that quotes the insertion log messages it is based on. It could serve as an example of the kind of output we could generate and what actions the script should support.
The disk came online as /dev/sde:
[Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: Direct-Access ATA WDC WD4000F9YZ-0 1A02 PQ: 0 ANSI: 6
[Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: SATA: handle(0x000b), sas_addr(0x4433221101000000), phy(1), device_name(0x50014ee004236411)
[Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: SATA: enclosure_logical_id(0x5001e67a2b9b8000), slot(1)
[Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[Tue Jan 24 15:36:22 2017] scsi 0:0:4:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
[Tue Jan 24 15:36:22 2017] sd 0:0:4:0: Attached scsi generic sg4 type 0
[Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
[Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] 4096-byte physical blocks
[Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] Write Protect is off
[Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] Mode Sense: 7f 00 10 08
[Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Tue Jan 24 15:36:22 2017] sde: unknown partition table
[Tue Jan 24 15:36:22 2017] sd 0:0:4:0: [sde] Attached SCSI disk
And is identified as:
Show quoted text Serial Number: WD-WCC5D4CLCZJF
Logical Unit WWN Device Identifier: 50014ee004236411
Switched off the LED.
Added the new WWN to the block-device udev mapping as dde and ran
sudo udevadm control --reload
sudo udevadm trigger
Verified that the /dev/dde symlink was created (ls -l /dev/dde*)
Re-ran Puppet to create the OSD
sudo puppet agent -t
A tricky part in our current (mostly manual) process is noticing when the service supplier replaces the disk. But presumably this is something that can be automated: When a disk is inserted, the OS will discover it and log something; we could have a periodic script that discovers this event in the log and kicks the process. Or we could even insert such a script as a "hook" that is called by the OS right when the new device is discovered. We except a few replacement disks in the next days/weeks, so that will be a great opportunity to develop and test such automation.