im-0 / hpsahba

Tool to enable/disable HBA mode on some HP Smart Array controllers
GNU General Public License v2.0
128 stars 45 forks source link

Success: P410i FW6.60 w/ Ubuntu 22.04 ILO2 Remote Media Boot #30

Open ianbmacdonald opened 1 year ago

ianbmacdonald commented 1 year ago

I used ILO2 (VSP or FirefoxESR55+Java8) and the current Ubuntu Jammy ISO to do some data recovery after a FWBC battery and cache replacement resulted in a reboot into what looked like an empty logical volume.

The 5.15 Ubuntu kernel only liked this patched repo, which doesn't need any module flags

https://github.com/mashuptwice/hpsahba

I was able to remote dd the entire logical volume first, then switch to HBA mode and dd the individual raw disks. So I have everything I need to reconstruct the 4 disk 1+0 array in 4 different ways to compare data and resolve some corruption. I can already see a clean partition table block on one of the raw disks, that is incomplete in the logical volume. ADU had nothing bad to say, but HBA mode is going to help me sort this out.

After setting up another SSH tunnel to connect local 222 directly to the HP DL360 G6 running the live Ubuntu iso mounted remotely through ILO2 I just pulled down a few packages with apt, built the dkms module, and a few dd commands later had images of the full volume, and the raw disks right in front of me.

Super pleased. Beers on me if you are ever in Toronto.

Already I can see complete and simple data blocks, like the protective MBR, on [2 of] the raw disks that are for some reason is mangled in the logical volume.

On My Local Laptop: ssh -p222 ubuntu@127.0.0.1 "sudo dd if=/dev/sda bs=512 | gzip -1 -" | pv -s 273G | dd of=sda.bin.gz On P410i via ILO: hpsahba -E /dev/sda On My Local Laptop:

 ssh -p222 ubuntu@127.0.0.1 "sudo dd if=/dev/sdc bs=512 | gzip -1 -" | pv -s 146G | dd of=sdc.bin.gz 
 ssh -p222 ubuntu@127.0.0.1 "sudo dd if=/dev/sdd bs=512 | gzip -1 -" | pv -s 146G | dd of=sdd.bin.gz 
 ssh -p222 ubuntu@127.0.0.1 "sudo dd if=/dev/sde bs=512 | gzip -1 -" | pv -s 146G | dd of=sde.bin.gz 
 ssh -p222 ubuntu@127.0.0.1 "sudo dd if=/dev/sdf bs=512 | gzip -1 -" | pv -s 146G | dd of=sdf.bin.gz
root@ubuntu:~/hpsahba# ./hpsahba -i /dev/sda
VENDOR_ID='HP'
PRODUCT_ID='P410i'
BOARD_ID='0x3245103c'
SOFTWARE_NAME=
HARDWARE_NAME=
RUNNING_FIRM_REV='6.60'
ROM_FIRM_REV='6.60'
REC_ROM_INACTIVE_REV='6.60'
YET_MORE_CONTROLLER_FLAGS='0xfa71a216'
NVRAM_FLAGS='0x00'
HBA_MODE_SUPPORTED=1
ianbmacdonald commented 1 year ago

Some additional notes on disabling HBA mode when I finished grabbing the raw disks.

After I grabbed the raw images, I used ./hpsahba -d /dev/sdc to change back to IR mode, and it seemed to work as verified by the output of the status (hpsahba -i) , however the logical volume did not re-enumerage as sda as expected. The drives appeared in dmesg as masked, after a redirect change/update. I went the extra mile and pulled down HP MCP sources via apt and installed and ran ssacli show config details and it confirmed the same thing; disks in the box, but no logical volume active. As suspected, a quick reboot and the volume was available, seemingly in the exact state as prior to enabling HBA mode.

It left me wondering if there was a way to bring back up the volumes; i.e. have the controller rescan the RIS data; without having to reboot and go through the initialization process.

ianbmacdonald commented 1 year ago

The analysis of the raw disks from the HBA mode allowed me to fully grasp what happened to our array. Its sort of an unbelievable thing, so I will share. We started with a 4-disk RAID5 array, including a hot spare. As it turns out, the spare had an old RAID 1+0 image on it and was in Bay 1. When we replaced the battery and upgraded the cache module, we landed on a defective cache module (stuck on boot). We removed the module and system POSTED fine and noted to us that the Array required a a cache module to boot (RAID5 requires this on the P410i). Put a 2nd module in, and system booted and "reconfigured" after detecting the battery and cache change. What happened next, I would not have guessed. The HP410i read the old RAID 1+0 RIS on the spare in Bay 1 and then proceeded to background resync/check that old logical volume information across the array, copying the spare image from Bay 1 onto Bay 3 and replicating one of the RAID5 images in Bay 2 into Bay 4 as if it was part of a RAID 1+0 set. No recovery option here. Surprisingly the P410i had no consideration for RIS data on other bays and/or modification dates or times. Just took that garbage from the spare disk and totally corrupted the array.