QUADS should remove LVM signatures from non system disks

bengland2 commented 7 years ago

Barry recently got an allocation for Cephfs testing, and when he went to install with ceph-ansible, it hit errors because there were LVM signatures on the OSD (data) drives. Would be great if QUADS could wipe LVM signatures and partitions from the drives (other than the system disk). Seems like you could do some kind of "pvscan" during the install and just nuke any LVM PVs other than the system disk with a force option. Then to get rid of partitions, do "parted -s /dev/whatever mktable gpt". When the system reboots after the install, it will be clean. This would just speed up process of deploying Ceph or Gluster on data drives.

bengland2 commented 7 years ago

pvremove -ff was the command I was trying to remember.

bengland2 commented 7 years ago

or maybe wipefs -a would work better since it just erases signatures.

sadsfae commented 7 years ago

hey @bengland2 we're very limited to what Anaconda supports during disk provisioning unless there's a way to identify non-OS LVM metadata and nuke that during kickstart %post when we have a bit more flexibility.

Something like a regex pvremove or wipefs would work here that clears everything but the pv's that has the system VG residing in it. This needs testing though.

Anaconda already runs the following which should clear the disks entirely prior to installation, if it doesn't it seems like a bug.

zerombr
clearpart --all --initlabel

This sounds similar to a bug we filed with RHEL atomic where it would leave cruft behind on the disk, except in that case we couldn't even get a proper install afterwards.

sadsfae commented 7 years ago

Update here, we believe this is a bug in Anaconda and have retrieved logs from a failed deployment that had LVM cruft on the disks.

The following was needed to manually intervene before Anaconda/kickstart would work.

pvremove /dev/* --force --force -y
for disk in $(ls /dev/{sd*,nv*}); do wipefs -a -f $disk; done

I'll update things here when I have a BZ to track.

bengland2 commented 7 years ago

great job, thank you! @jharriga @bmarson , you like ;-)?

An annoying question if I may -- does your kickstart file use "clearpart" directive to clear out the block device before RHEL install? If so, Is it a bug in Anaconda if "clearpart" doesn't clear LVM signatures? After all, "clearpart" is supposed to erase partitions. Maybe Anaconda needs a different directive like "clearpvs" or something like that? Should clearing PVs be implicit in "clearpart"?

sadsfae commented 7 years ago

Hey @bengland2 yes, we use both clearpart and zerombr, e.g.

zerombr
clearpart --all --initlabel

I think that Anaconda should have no problem clearing partitions/disks as it's normal behavior as it will do this with other existing filesystems but chokes on the LVM signatures as we've seen. I recall other bug reports about this in the past with similar disk metadata (I'll need to go back and find these).

I've dug through the Anaconda documentation and the options around disk manipulation during kickstart is very limited - that's pretty much all I've found. I agree that clearing all PV's should be inclusive to the kickstart prep process so we'll see what the developers say here.

We may need to contact someone internally on the Anaconda/Base OS side if you know anyone.

bengland2 commented 7 years ago

@Deepthidharwar this was the discussion I referenced, @sadsfae just FYI Deepthi hit the same problem on BAGL and couldn't even reconfigure the RAID controller successfully using this playbook. I think we could fix this playbook to do what you did above and then it would work always.

bengland2 commented 7 years ago

@sadsfae @kambiz-aghaiepour can the commands in Sep 19 post be part of %pre in the kickstart file? In other words, can we do this stuff BEFORE the operating system is installed on the storage? Then we don't have to change Anaconda. I think pvremove and wipefs would be available at that point, not sure though. And I think it would be

for disk in /dev/sd[a-z] /dev/nvmen[0-9] ...

so that partitions are not selected.

sadsfae commented 7 years ago

@bengland2 I'm afraid that's not possible without some kind of very custom kickstart which first boots to a ramdisk / initrd to exact those commands - tackling this here in kickstart really isn't ideal we should try to fix it in RHEL.

I've filed the following bug for this and we can follow-up here:

https://bugzilla.redhat.com/show_bug.cgi?id=1506680

bengland2 commented 6 years ago

We tried this on a BAGL system that had leftover boot blocks, so we PXE booted to Linux rescue mode and did @sadsfae suggestion above. In addition, to wipe all MBRs on the system,

for disk in $(ls /dev/{sd*,nv*}); do dd if=/dev/zero of=$disk bs=512 count=1 oflag=direct ; done

we verified that boot blocks were erased by rebooting system, it could not find any boot blocks. pvscan in rescue mode verified that nothing of LVM was left on drives.

Chris was trying to fix this problem in BAGL so we then deploy on beaker (as in the old BAGL page) with:

ignoredisk=--only-use=sda

bengland2 commented 6 years ago

turns out this was a bad idea to zero out all the MBRs without replacing them, seems like anaconda is not installing an MBR for us now!

John Harrigan @jharriga found this article on repairing MBR, which recommends

grub2-install --root-directory=/ /dev/sda

maybe we should have tried that instead of "grub2-install /dev/sda", Chris @QuantumPosix @ekaynar

I also found this article about restoring Linux MBR that suggested

sudo grub-install --recheck /dev/sda

bengland2 commented 6 years ago

here's my latest script for cleansing a system of all LVM and partition signatures, except for the operating system disk (assumes there is only 1). If the operating system is spread across all disks, this will not work as is unfortunately. we could run something like this from a PXE-booted rescue image. This script is limited in scope to systems with MegaRAID controllers that allow storcli utility to manage them. Note that it assumes that the MegaRAID controller is reset to factory defaults. It then creates virtual drives, 1 per HDD, and only then does it try to erase signatures.

shekharberry commented 4 years ago

Hi Team,

In one of my recent Alias lab allocation, Data (OSD) drives were not clean and had LVM signatures and old CEPH partitions on it. This led to failed installation of OCS/CEPH and it consumed a fair amount of time before it got fixed and I got a successful deployment. It would be a lot less time-consuming if we receive devices in a known state on which we can also use our automation to setup desired configuration.

Few Ideas that I can think of that can be part of QUADS allocation:

We can have a script pre-build in quads that a user can run on receiving the hardware to clean all the storage devices (something in the lines of clean-interfaces.sh)?
We can have an option to request clean up of devices in the allocation request page itself and based on input QUADS can clean up devices.

@briordan1 Since this issue about clean devices is already open, I am not putting a new RFE request for the same.

sadsfae commented 4 years ago

Hi Team,

Howdy @shekharberry

In one of my recent Alias lab allocation, Data (OSD) drives were not clean and had LVM signatures and old CEPH partitions on it. This led to failed installation of OCS/CEPH and it consumed a fair amount of time before it got fixed and I got a successful deployment. It would be a lot less time-consuming if we receive devices in a known state on which we can also use our automation to setup desired configuration.

Sorry for any trouble. Systems that cannot cleanly kickstart from Foreman should never be passed along to a tenant, if this happens then something in the QUADS configuration of ALIAS is not performing full validation steps.

We will never release systems that have not had a clean OS deployment, this is part of the validation phase that checks if a system is marked for build (meaning it never kickstarted successfully).

The bug we opened back in 2017 against Anaconda / blivet (Python storage mechanism of Anaconda) was to make Anaconda more "aware" of these kinds of phantom and metadata cruft. This has not been fixed upstream, nor does it look like any additional enhancements will happen here.

Addressing disk cruft has happen within the constraints of Anaconda. We cannot have any way to ensure we can get to systems returning from another allocation with any certainty in order to perform any actions outside of what Anaconda does when it reprovisions most of the time.

The end result is that we will catch systems that fail validation, including ones that have metadata cruft on the disk that Anaconda cannot (by current design/limitation) remove. The tenant will never (or should never) receive these systems until validation is passed.

When we come across systems like this typically this is how we "fix" it manually:

pvremove /dev/* --force -y
for disk in $(ls /dev/{sd*,nv*}); do wipefs -a -f $disk; done

These systems are then tossed back into the validation workflow until they reprovision cleanly and pass other checks.

This "fix" cannot be run on a live operating system, so if you ran this on a working system you'd effectively nuke it, typically this is run from a rescue shell where Anaconda dumps you when it doesn't know how to deal with these kinds of disk metadata/cruft

Either way, you should never notice a difference, only a delay in that our automated re-provisioning/cleaning/validation has failed and will incur some delay to receiving the systems as we have to manually address those cases. Our intention to get this fixed was to have the Anaconda team address this (as it's our own reliable vehicle to enact additional pruning of disks) but alas we have not had any real progress on the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1506680

Few Ideas that I can think of that can be part of QUADS allocation:

* We can have a script pre-build in quads that a user can run on receiving the hardware to clean all the storage devices (something in the lines of clean-interfaces.sh)?

This is not feasible due to process outlined above. There is no reliable window or approach to access the OS of bare-metal systems (some won't even have an OS sometimes, or it was image-based perhaps, different credentials, CoreOS, whatever) to perform actions outside of what Anaconda does during kickstart to do any sort of prep. There is no reliable way to ensure we can run userspace tools or scripts outside of the Anaconda process.

This is a chicken and egg problem.

Systems return to the resource pool or get scheduled to go into clouds with various, inconsistent or inaccessible OS/setup on them from prior tenants.
You can't run userland tools/scripts/processes without an OS on the system typically.
Anaconda provides a subset of limited userspace tools and functionality that can be run in the chroot but not all scenarios. You especially can't be operating on the disk you are actively installing to.
The only reliable way to put an OS on the system is via Foreman kickstart called by QUADS
Foreman kickstart uses a chroot and works within the limitation of Anaconda/blivet to configure storage, lay down the OS and packages and render a usable operating system.
Due to both chroot and the way kickstart %post operates you are limited to the amount of external commands, scope or availability of userspace tools and actions you can run against a system being deployed.
There is no viable window to enact pre-pre-deployment steps or scripts in this case.

Remember, a system should not kickstart cleanly if there is LVM or disk metadata consuming the disk, disk signatures, foreign metadata if the following Anaconda actions cannot find and remove it. This means you should not even receive systems in this state because they need to kickstart before they pass validation anyway.

If you're receiving systems in a state like this then something is wrong/off with the system type in ALIAS here and that's what should be tracked down and investigated. Information towards this end will help us look into it (but please file an internal JIRA ticket as this relates to our R&D systems specifically and has little to do with QUADS other than perhaps the way it's configured).

* We can have an option to request clean up of devices in the allocation request page itself and based on input QUADS can clean up devices.
@briordan1 Since this issue about clean devices is already open, I am not putting a new RFE request for the same.

Summary here, there is no viable, reliable vehicle to perform pruning or cleaning actions on systems outside Anaconda (which really should be doing this). More pressure/attention should be directed to the upstream bug we filed years ago due to crap left on disk from ostree-based stacks to do more here.

Again, you should not even receive systems in this state anyway, it just means that until Anaconda has extended capabilities in blivet to tackle this it's us who will need to spend more time manually fixing the rare occurrence of this before putting those machines back into the validation process so they can matriculate as planned through the gated stages until they are released to tenants.

There's not much more we can do here, this needs to be addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1506680 and the following Anaconda/blivet storage prep and userspace commands should be enhanced to catch this.

zerombr
clearpart --all --initlabel

QuantumPosix commented 4 years ago

There was no issue with the handoff, Quads moved it cleanly with no manual intervention. This means that the rebuild was successful (and other disk may have had metadata on it where anaconda wiped the main one). This again stems from the bug you filed. ~ Thanks,

Chris

On Thu, Oct 29, 2020 at 11:36 AM Will Foster notifications@github.com wrote:

Hi Team,

Howdy @shekharberry https://github.com/shekharberry

In one of my recent Alias lab allocation, Data (OSD) drives were not clean and had LVM signatures and old CEPH partitions on it. This led to failed installation of OCS/CEPH and it consumed a fair amount of time before it got fixed and I got a successful deployment. It would be a lot less time-consuming if we receive devices in a known state on which we can also use our automation to setup desired configuration.

Sorry for any trouble. Systems that cannot cleanly kickstart from Foreman should never be passed along to a tenant, if this happens then something in the QUADS configuration of ALIAS is not performing full validation steps.

We will never release systems that have not had a clean OS deployment, this is part of the validation phase that checks if a system is marked for build (meaning it never kickstarted successfully).

The bug we opened back in 2017 against Anaconda / blivet (Python storage mechanism of Anaconda) was to make Anaconda more "aware" of these kinds of phantom and metadata cruft. This has not been fixed upstream, nor does it look like any additional enhancements will happen here.

Addressing disk cruft has happen within the constraints of Anaconda. We cannot have any way to ensure we can get to systems returning from another allocation with any certainty in order to perform any actions outside of what Anaconda does when it reprovisions most of the time.

The end result is that we will catch systems that fail validation, including ones that have metadata cruft on the disk that Anaconda cannot (by current design/limitation) remove. The tenant will never (or should never) receive these systems until validation is passed.

When we come across systems like this typically this is how we "fix" it manually:

pvremove /dev/ --force -y for disk in $(ls /dev/{sd,nv*}); do wipefs -a -f $disk; done

Either way, you should never notice a difference only a delay in that our automated re-provisioning/cleaning/validation has failed and will incur some delay to receiving the systems. Our intention to get this fixed was to have the Anaconda team address this (as it's our own reliable vehicle to enact additional pruning of disks) but alas we have not had any real progress on the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1506680

Few Ideas that I can think of that can be part of QUADS allocation:

We can have a script pre-build in quads that a user can run on receiving the hardware to clean all the storage devices (something in the lines of clean-interfaces.sh)?

This is not feasible due to process outlined above. There is no reliable window or approach to access the OS of bare-metal systems (some won't even have an OS sometimes, or it was image-based perhaps, different credentials, CoreOS, whatever) to perform actions outside of what Anaconda does during kickstart to do any sort of prep. There is no reliable way to ensure we can run userspace tools or scripts outside of the Anaconda process.

This is a chicken and egg problem.

You can't run userland tools/scripts/processes without an OS on the system.

Systems return to the resource pool or get scheduled to go into clouds with various, inconsistent or inaccessible OS/setup on them from prior tenants.

The only reliable way to put an OS on the system is via Foreman kickstart called by QUADS

Foreman kickstart uses a chroot and works within the limitation of Anaconda/blivet to configure storage, lay down the OS and packages and render a usable operating system.

Due to both chroot and the way kickstart %post operates you are limited to the amount of external commands, scope or availability of userspace tools and actions you can run against a system being deployed.

There is no viable window to enact pre-pre-deployment steps or scripts in this case.

Remember, a system should not kickstart cleanly if there is LVM or disk metadata consuming the disk, disk signatures, foreign metadata if the following Anaconda actions cannot find and remove it. This means you should not even receive systems in this state because they need to kickstart before they pass validation anyway.

If you're receiving systems in a state like this then something is wrong/off with the system type in ALIAS here and that's what should be tracked down and investigated. Information towards this end will help us look into it (but please file an internal JIRA ticket as this relates to our R&D systems specifically and has little to do with QUADS other than perhaps the way it's configured).

We can have an option to request clean up of devices in the allocation request page itself and based on input QUADS can clean up devices.

@briordan1 https://github.com/briordan1 Since this issue about clean devices is already open, I am not putting a new RFE request for the same.

Summary here, there is no viable, reliable vehicle to perform pruning or cleaning actions on systems outside Anaconda (which really should be doing this). More pressure/attention should be directed to the upstream bug we filed years ago due to crap left on disk from ostree-based stacks to do more here.

Again, you should not even receive systems in this state anyway, it just means that until Anaconda has extended capabilities in blivet to tackle this it's us who will need to spend more time manually fixing the rare occurrence of this before putting those machines back into the validation process so they can matriculate as planned through the gated stages until they are released to tenants.

There's not much more we can do here, this needs to be addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1506680 and the following Anaconda/blivet storage prep and userspace commands should be enhanced to catch this.

zerombr clearpart --all --initlabel

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/redhat-performance/quads/issues/133#issuecomment-718835304, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3IDEE5KKT52UGPQHSXSTLSNGDY7ANCNFSM4DYK3CCA .

sadsfae commented 4 years ago

There was no issue with the handoff, Quads moved it cleanly with no manual intervention. This means that the rebuild was successful (and other disk may have had metadata on it where anaconda wiped the main one). This again stems from the bug you filed. ~ Thanks, Chris

This should take care of this happening and cover all physical disks known to the system. However as we've seen it will not cover certain kinds of cruft left behind like os-tree metadata, certain LVM types etc (this issue filed and our subsequent related bugzilla).

zerombr
clearpart --all --initlabel

In this case you'd be talking about non-OS disks as I understand it then. I was/am speaking about OS disks only (as we have mechanisms to handle this happening on RAID/non-OS disks I'll explain below).

For systems with LSI / RAID / non-OS disks we also run through storcli64 commands during kickstart %post which nukes all LSI/RAID disks, wipes them, makes individual JBOD/R0 block devices on each one so they are presented to the OS as individual block devices.

If the clinch here are these kinds of disks then Foreman should be calling storcli64 against them (or an equivalent tool, likely racadm exact equivalent commands) like we do in Scale Lab for our SuperMicro storage-class systems (6048r, 6049p)

e.g. (copied from Foreman templates)

yum install -y http://foreman.example.com:8080/repo/storcli/storcli-1.15.05-1.noarch.rpm
/opt/MegaRAID/storcli/storcli64 /c0/vall delete
# run one more time to clear current configuration
/opt/MegaRAID/storcli/storcli64 /c0/vall delete
# create all disks and VD's and present as JBOD
for x in $(seq 0 22); do /opt/MegaRAID/storcli/storcli64 /c0 add vd type=r0 drives=93:$x; done
for x in $(seq 23 29); do /opt/MegaRAID/storcli/storcli64 /c0 add vd type=r0 drives=94:$x; done
for x in $(seq 39 44); do /opt/MegaRAID/storcli/storcli64 /c0 add vd type=r0 drives=94:$x; done

Because ALIAS uses primarily Dell r740xd systems (and I assume this is the model you had a problem on?) a racadm equivalent should be in place in the Foreman templates there in lieu of the approach we take with storcli64 if that does not work. That should solve this from happening on non-OS disks.

Additionally a script like clean-interfaces.sh can be provided too so long as it's generated to only be pointed against RAID/non-OS disks if the racadm equivalent of our storcli64 kickstart %post commands don't fully do what we need.

Further, right now as I type this if the issue is related to non-OS disks then manually running racadm or storcli64 commands after someone has the systems on a freshly deployed OS should work too (but we should provide a solution for running that documented somewhere if so, or it should be run during the provisioning phase via Foreman snippet.

Note that we purposefully also use /dev/disk-by-path as well to ensure we deploy on the small, internal SATA disk and not on the RAID disks for OS but I think that ALIAS already does this.

bengland2 commented 4 years ago

@sadsfae that bz you cited is 3 years old, let's admit that no one is going to fix it and move on.

@QuantumPosix , it is not a "successful handoff" unless the system is in a known state where the user's automation can reliably pick up from there and deploy their software.

The last reply sounds promising, reset the RAID controller to a known state as you said, and then reset the devices to a blank state, maybe clean off LVM crud (all LVs, VGs and PVs not used by the OS) and then wipefs -a on the raw block devices to erase the partition tables.

sadsfae commented 4 years ago

@sadsfae that bz you cited is 3 years old, let's admit that no one is going to fix it and move on.

@QuantumPosix , it is not a "successful handoff" unless the system is in a known state where the user's automation can reliably pick up from there and deploy their software.

The last reply sounds promising, reset the RAID controller to a known state as you said, and then reset the devices to a blank state, maybe clean off LVM crud (all LVs, VGs and PVs not used by the OS) and then wipefs -a on the raw block devices to erase the partition tables.

Tackling things at the raid controller level will bridge the gap here, something we can do with localized/internal Foreman templating using racadm on Dell or the existing storcli64 templating we use for SuperMicro elsewhere.

I can work with @QuantumPosix to this end, we will have some r740xd soon in Scale Lab or we can test this on a free ALIAS system.

When this is in place we'd ask folks to help us test this, or provide a way to introduce some disk signatures so we can test it. I believe it's enough to do what we do today with SuperMicro and storcli64 but utilize equivalent racadm commands, and this can all be done in kickstart %post as we do today.

Since metadata/LVM cruft on OS disks is caught by validation (then through a rescue shell fix offenders using wipefs before tossing it back into the validation workflow to pass) this is covered otherwise. When I say covered, I mean that we feel the pain for this, not the tenant as the only reasonable path to fix this is through the Anaconda ecosystem/tools, barring a fix there which seems unlikely we will have to limp along with what we have.

At this point I think we can close this github issue as it's no longer really related at all to general QUADS behavior, nor can any modifications to QUADS help us here.

@QuantumPosix @bengland2 let's track this going forward in an internal PERFINFRA JIRA ticket.

redhat-performance / quads

QUADS should remove LVM signatures from non system disks #133