plus3it / spel

STIG-Partitioned Enterprise Linux (spel)
Other
94 stars 61 forks source link

Reboot Issues After Resizing #695

Closed ferricoxide closed 2 months ago

ferricoxide commented 3 months ago

Creating new ticket as it seems like OP for #691 may no longer be having issues (no comment since June 14th) and don't want to continue spamming if such is the case.

Originally posted by @mrabe142 in https://github.com/plus3it/spel/issues/691#issuecomment-2168365155:

Just wanted to update on what I have observed so far after a bit of testing. To clarify, I have no issues launching the AMI. I am only running into issues after I have done the initial SSH to the instance and reboot.

I spun up seven different instances:

  1. spel-minimal-rhel-8-hvm-2024.5.1 with t2.2xlarge, 500 GB gp3, legacy-bios
  2. spel-minimal-rhel-8-hvm-2024.2.1 with t2.2xlarge, 500 GB gp3, legacy-bios
  3. spel-minimal-rhel-8-hvm-2024.5.1 with t3.micro and default 20 GB gp3, uefi
  4. spel-minimal-rhel-8-hvm-2024.5.1 with t2.2xlarge, 50 GB gp3, legacy-bios
  5. spel-minimal-rhel-8-hvm-2024.5.1 with t3.xlarge, 50 GB gp3, uefi
  6. spel-minimal-rhel-8-hvm-2024.5.1 with t3.2xlarge, 500 GB gp3, uefi
  7. spel-minimal-rhel-8-hvm-2024.1.1 with t3.2xlarge, 100 GB gp3, legacy-bios

The first one started to have boot volume mounting issues after the second reboot, no modifications/updates The second one was rebooting fine, did it around six times. I did a 'dnf update' and rebooted, now it has a kernel panic that it cannot mount the root files system The third one is rebooting fine, have done multiple reboots, did 'dnf update', ran more reboots and seems stable so far. The fourth one had boot volume mounting issues after first reboot, no modifications/updates The fifth one is rebooting as normal multiple times, did 'dnf update', rebooted multiple times and still working The sixth one had boot volume mounting issues after rebooting, no modifications/updates The seventh one is is rebooting as normal multiple times, did 'dnf update', rebooted multiple times and still working

In summary, 3, 5, and 7 do not exhibit reboot mounting issues (as of yet). 2 and 7 used a different version of the AMI. I have not tried to resize any volumes yet.

On comparing, I do see some differences in the devices. 1 and 4 have /dev/xvda# devices and 3 has /dev/nvme0n1p# devices. The device names change as different instance types and disk sizes are chosen.

I will continue to test after doing some resizing and hardening to see if that changes anything.

And, per @mrabe142 in https://github.com/plus3it/spel/issues/691#issuecomment-2187415382:

Just wanted to provide an update. I am not current blocked on this so there is no urgent need for me, you can close it unless you want to investigate further.

I did another round of testing. I tried 5 different VMs each with these configurations:

  • t3.2xlarge
  • 100 GB gp3 EBS storage

I used the following 5 AMIs with the above configuration:

  1. spel-minimal-rhel-8-hvm-2024.01.1.x86_64-gp3
  2. spel-minimal-rhel-8-hvm-2024.02.1.x86_64-gp3
  3. spel-minimal-rhel-8-hvm-2024.03.2.x86_64-gp3
  4. spel-minimal-rhel-8-hvm-2024.05.1.x86_64-gp3
  5. spel-minimal-rhel-8-hvm-2024.06.1.x86_64-gp3

This is what I found for each configuration:

  1. I was able to use this AMI without any issues. Since it is from January, it uses Legacy BIOS. I was able to do a full 'dnf update' to bring all packages up to date. I was able to extend the LVM sizes and fully STIG the VM.
  2. This AMI did not work, it has a kernel panic after 'dnf update' and rebooting.
  3. I was able to use this AMI. It is in the UEFI configuration. I was able to do a full 'dnf update' to bring all packages up to date. I was able to extend the LVM sizes and fully STIG the VM.
  4. This AMI has the booting issues, especially after a 'dnf update'. Sometimes does not manifest until after trying to extend an LVM mount.
  5. This AMI has the booting issues

Since I am able to use configurations 1 and 3, I am able to continue with what I need to do. All VMs are using the same CPU/Mem/EBS configurations and all were updated to the latest packages so it is not clear to me what is causing the issues with the newer AMIs.

And, per @mrabe142 in https://github.com/plus3it/spel/issues/691#issuecomment-2189641239:

I think the t3 instance types are nitro capable but I don't think I selected anything about nitro when I launched them, I just launch on-demand instances from the Launch Instances menu of EC2->Instances page of the AWS GovCloud Console. The only things I set is the name, AMI, instance type (t3.2xlarge), choose my public key, put into VPC and subnet that was provided for me, apply security group that was provided for me, set storage size to 100GB gp3 (non-encrypted). The rest of the settings are left as default.

I am not running any automation at this point to do the initial setup, when the instance comes up, I SSH to it. The first thing I try is a dnf update and reboot. I do reboot a couple times to see if the boot issue comes up. If it continues to come up after a couple reboots without errors, I try to extend the LVM mounts. Depending on the instance types, I do these:

For the spel-minimal-rhel-8-hvm-2024.01.1.x86_64-gp3 AMI (only has 2 partitions):

sudo growpart --free-percent=50 /dev/nvme0n1 2
sudo lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
sudo xfs_growfs /dev/mapper/RootVG-rootVol
sudo lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
sudo xfs_growfs /dev/mapper/RootVG-varVol
sudo lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
sudo xfs_growfs /dev/mapper/RootVG-auditVol

For the other AMIs that have 4 partitions:

sudo growpart --free-percent=50 /dev/nvme0n1 4
sudo lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
sudo xfs_growfs /dev/mapper/RootVG-rootVol
sudo lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
sudo xfs_growfs /dev/mapper/RootVG-varVol
sudo lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
sudo xfs_growfs /dev/mapper/RootVG-auditVol

I try rebooting again after applying those. If they reboot a couple times without boot errors, they are usually stable at that point for all the rest of the configuration I apply to them.

ferricoxide commented 3 months ago

@mrabe142

Ok, to make this testable in a scalable way, I've converted what you've described into a userData script. That script looks like:

#!/bin/bash
#
# Bail on errors
set -euo pipefail
#
# Be verbose
set -x
#
################################################################################

# Log everything below into syslog
exec 1> >( logger -s -t "$(  basename "${0}" )" ) 2>&1

# Patch-up the system
dnf update -y

# Allocate the additional storage
if  [[ $( mountpoint /boot/efi ) =~ "is a mountpoint" ]]
then
  if  [[ -d /sys/firmware/efi ]]
  then
    echo "Partitioning for EFI-enabled instance-type..."
  else
    echo "Partitioning for EFI-ready AMI..."
  fi

  growpart --free-percent=50 /dev/nvme0n1 4
  lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
  xfs_growfs /dev/mapper/RootVG-rootVol
  lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
  xfs_growfs /dev/mapper/RootVG-varVol
  lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
  xfs_growfs /dev/mapper/RootVG-auditVol
else
  echo "Partitioning for BIOS-boot instance-type..."
  growpart --free-percent=50 /dev/nvme0n1 2
  lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
  xfs_growfs /dev/mapper/RootVG-rootVol
  lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
  xfs_growfs /dev/mapper/RootVG-varVol
  lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
  xfs_growfs /dev/mapper/RootVG-auditVol
fi

# Reboot
systemctl reboot

The above should function for either BIOS-boot or EFI-boot EC2s launched from the various spel AMIs and replicate what you described as your process. Some notes:

To launch a batch (of 30) instances, I use a BASH one-liner like:

mapfile -t INSTANCES < <(
  aws ec2 run-instances \
    --image-id ami-021ba76fc66135488 \
    --instance-type t2.xlarge \
    --subnet-id <SUBNET_ID> \
    --security-group-id <SECURITY_GROUP_ID> \
    --iam-instance-profile 'Name=<IAM_ROLE_NAME>'
    --key-name <PROVISIONING_KEY_NAME> \
    --block-device-mappings 'DeviceName=/dev/sda1,Ebs={
        DeleteOnTermination=true,
        VolumeType=gp3,
        VolumeSize=100,
        Encrypted=false
      }' \
    --user-data file:///tmp/userData.spel_695 \
    --count 30 --query 'Instances[].InstanceId' \
    --output text | \
  tr '\t' '\n'
)

This saves all of the newly-launched instance's IDs to a BASH array (named INSTANCES), that I can then loop over to do things like reboot each instance an arbitrary number of times and check that each has returned from said reboots.

ferricoxide commented 3 months ago

As a quick "ferinstance": to reboot all of the instances in a batch (captured into the INSTANCES array), one can do something like:

for INSTANCE in "${INSTANCES[@]}"
do
  echo "Rebooting $INSTANCE..."
  INSTANCE_IP="$(
    aws ec2 describe-instances \
      --instance-id "${INSTANCE}" \
      --query 'Reservations[].Instances[].PrivateIpAddress' \
      --output text 
  )
  timeout 10 ssh "maintuser@${INSTANCE_IP}" "sudo systemctl reboot"
done
ferricoxide commented 3 months ago

Note: after spinning up 30 t3 instances from the "This AMI has the booting issues" spel-minimal-rhel-8-hvm-2024.06.1.x86_64-gp3 AMI (ami-021ba76fc66135488), I used the above userData to provision them followed by using the iterative-reboots script to reboot each of those instances about a dozen times. They all came back from each of their dozen-ish reboots. I can retry with the spel-minimal-rhel-8-hvm-2024.02.1.x86_64-gp3 (ami-092037daccf8526f7) and spel-minimal-rhel-8-hvm-2024.05.1.x86_64-gp3 (ami-0455bbb8b742553ba) AMIs, but I'm not optimistic that they're any more likely to fail than in the prior testing. Since I'm not seeing failures in the above-described testing, I have to assume there's something unique to your situation that you have as yet to adequately convey.

Ultimately, I would invite you to replicate what I've done and notify me if you have failures and/or provide a more-complete description of how to reproduce the issues you're seeing.

If you wish to further discuss but don't want to include potentially-sensitive information in this public discussion, you can email me at spel-mrabe142@xanthia.com (obviously, this is a "throwaway" address used for establishing initial, private communications betwen you and me).

ferricoxide commented 3 months ago

One last question, @mrabe142: if you're currently finding success with deploying using the 03.2 AMI, are you able to patch it? Asking because, sometime after the introduction of the EFI-ready AMIs, we received reports that the /boot partition was too small. While that issue was addressed on April 11th, I can't remember which AMI incorporated that fix. Which is to say, if you're planning to base your processes on using an older AMI is your solution-path, you might run into "not enough space in /boot" issues when you go to do a dnf update.

mrabe142 commented 3 months ago

Yes, I do a dnf update on all test instances, I did not encounter a problem with 03.2 but I imagine the space issue is what is causing 02.1 to kernel panic.

I did some preliminary testing using spel-minimal-rhel-8-hvm-2024.06.1.x86_64-gp3

I used the same userdata block as you above, same storage size.

I tested with two different instance types:

  1. t3.xlarge
  2. t3.2xlarge

I ran two instances of each type. For type 1, both instances seemed fine (rebooted multiple times) For type 2, both instances had rebooting problems (on after instance launch, one after doing an SSH to the instance and rebooting)

The only differences were the instance type between the two. It might be worth spinning up with the second instance type and see if you see the same thing. I can run more tests when I have more time.

ferricoxide commented 3 months ago

Ok. I'll try with the t3.2xlarge.

That said, the build-platform used for creating the Amazon Machine Images are t3.2xlarge. Further, I've successfully used the resulting images with m6i, m7i and m7i-flex instance-families with 2xlarge, 4xlarge and 8xlarge sizes. Though, in fairness, I almost never resize the boot-EBS. Best practices for storage are to put OS data and application-data on separate devices. As a result, any storage-needs beyond what's needed for the base OS (e.g., to increase the size of /var/log), I almost always do on secondary EBSes managed either as bare filesystem devices or as secondary LVM VGs & volumes (and, when this is for hosting /home, I move /home to a secondary EBS that overlays the trivial amount set aside for /home in the default AMI – part of why that partition's size is paltry in our build-defaults).

At any rate, "I guess we'll see": it gives me one more thing to try out.

ferricoxide commented 3 months ago

Alright. Interesting. Launched 50 of the t3.2xlarge instances using the spel-minimal-rhel-8-hvm-2024.06.1.x86_64-gp3 AMI with the previously-noted userData payload and had 11 reboot-failures. While any non-zero failure rate isn't acceptable, a 22% rate is significantly more than merely not acceptable.

I'll see what, if anything, I can do to get information from them.

ferricoxide commented 3 months ago

Curiouser and curiouser…

I hadn't realized from prior communications that your reboot failures were leaving you at emergency-mode, so I hadn't included setting a root-password in my userData payload. So, I remedied that and launched a new set of 50. However, wanting to try to save money on the troubleshooting, I'd changed that batch to a t3.small instance-type and got a 0% failure-rate. So, going to a try a couple of other instance-types to see if the problem is specific to t3.2xlarge or if "some batches are luckier than others".

ferricoxide commented 3 months ago

Oof… Switch back to t3.2xlarge and got 19/50 failures.

W T F

ferricoxide commented 3 months ago

Just as a hedge against there being some kind of race-condition causing issues when executing:

  growpart --free-percent=50 /dev/nvme0n1 4
  lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
  xfs_growfs /dev/mapper/RootVG-rootVol
  lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
  xfs_growfs /dev/mapper/RootVG-varVol
  lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
  xfs_growfs /dev/mapper/RootVG-auditVol

I changed it to the more-compact (and internally-enforced workflows):

  growpart --free-percent=50 /dev/nvme0n1 4
  lvresize -r --size=+4G /dev/mapper/RootVG-rootVol
  lvresize -r --size=+8G /dev/mapper/RootVG-varVol
  lvresize -r --size=+6G /dev/mapper/RootVG-auditVol

Sadly, made no difference in presence of reboot failures: the batch that I used the changed content in was 12/50.

ferricoxide commented 3 months ago

Ok. This may be an issue with the systemd version in RHEL 8.x. When I looked up the error message coming from systemd's trying to mount /var/log/audit:

# systemctl --no-pager status -l var-log-audit.mount
● var-log-audit.mount - /var/log/audit
   Loaded: loaded (/etc/fstab; generated)
   Active: failed (Result: protocol) since Mon 2024-07-01 13:27:30 UTC; 32min ago
    Where: /var/log/audit
     What: /dev/mapper/RootVG-auditVol
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)

Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: Mounting /var/log/audit...
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: var-log-audit.mount: Mount process finished, but there is no mount.
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: var-log-audit.mount: Failed with result 'protocol'.
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: Failed to mount /var/log/audit.

I was able to turn up an issue filed in late 2018 against the systemd project. That issue mentioned:

systemd version the issue has been seen with 239, 238, 237

And, when I check the failed EC2s, I see:

# systemctl --version
systemd 239 (239-82.el8)
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy

Why I didn't see this on t3s smaller than 2xlarge, this morning, I don't know. I'd try larger t3 instances types, but there aren't any. I'm going to try a couple of m-class batches in the 2xlarge sizes to see if I get it there or not.

ferricoxide commented 3 months ago

Just as a sanity-check I tried two more tests to validate that the problem only occurs when modifying either the "OS volumes" ( i.e., /, /var, /var/log, /var/log/audit) or if it's any modification to the boot volume-group:

As noted previously, haven't encountered issues when adding secondary EBS to host non-OS data. Similarly, our processes around creating AMIs only tests whether the boot-EBS can be grown, but does not make any LVM modifications or do any reboots.

At any rate, the manner in which we generally use Linux EC2s and test the AMIs we publish likely accounts for why the underlying problem hasn't previously manifested.

Going to open an issue with Red Hat.

In the interim, I would suggest not trying to grow or create further volumes within the root LVM2 volume-group. Which is to say, if you've got application-data that's been driving you to alter the boot volumes' sizes, place that data on LVM objects outside of the root volume-group.

ferricoxide commented 3 months ago

Issue opened with Red Hat. Engineers are reviewing sos report outputs for causes. However, initial response has been "this looks like it should be rebooting".

ferricoxide commented 3 months ago

Red Hats has moved the case from their Storage team – the ones who deal with LVM, etc. – to their Services team – the ones that oversee systemd-related issues. It's believed the issue is a race-condition in sytemd's mount handlers.

In the near term, if you're expanding volumes to host application-data, switch to hosting that data on volumes separate from the Root volume-group and mount as appropriate. Otherwise, until Red Hat can identify a real fix, they recommend adding the nofail option to the /var/log/audit entry in the /etc/fstab file.

ferricoxide commented 3 months ago

Got a response back from the Service Team, last night.

Based on the findings so far, this looks to be related to a race condition in systemd 239. The upstream bug:

systemd: mount units fail with "Mount process finished, but there is no mount." · Issue #10872 · systemd/systemd · GitHub https://github.com/systemd/systemd/issues/10872

And the commit that fixes it:

mount: mark an existing "mounting" unit from /proc/self/mountinfo as ▒~@▒ · systemd/systemd@1d086a6 · GitHub https://github.com/systemd/systemd/commit/1d086a6e59729635396204fc05234f1d3caa0847

We have a Jira open for the behavior when seen with NFS mounts, but based on the upstream bug and commit, I'm not confident that the behavior is isolated to NFS remote mounts, but more so the race in general:

[RHEL-5907] NFSv3 remote mounts fail with "Mount process finished, but there is no mount." when a daemon-reload happens while mounting the remote mount - Red Hat Issue Tracker https://issues.redhat.com/browse/RHEL-5907

In the circumstance on this case, it seems to present the same behaviors, and systemd is reloaded in parallel with the mounting of the filesystem

Following that RHEL-5907 issue-link, looks like this has been going on since at least September. Not sure why any of the AMIs have worked for you. I don't have the time to verify, but I'm going to assume that the issue is present in all of our RHEL 8 AMIs:

---------------------------------------------------------------------------
|                             DescribeImages                              |
+------------------------+------------------------------------------------+
|         ImageId        |                     Name                       |
+------------------------+------------------------------------------------+
|  ami-0b86f42e4059bde4b |  spel-minimal-rhel-8-hvm-2023.12.1.x86_64-gp3  |
|  ami-0363cf6882daf4895 |  spel-minimal-rhel-8-hvm-2024.01.1.x86_64-gp3  |
|  ami-092037daccf8526f7 |  spel-minimal-rhel-8-hvm-2024.02.1.x86_64-gp3  |
|  ami-0373eef9e2b3b4bcd |  spel-minimal-rhel-8-hvm-2024.03.2.x86_64-gp3  |
|  ami-0e770985d7a4b2822 |  spel-minimal-rhel-8-hvm-2024.04.1.x86_64-gp3  |
|  ami-0455bbb8b742553ba |  spel-minimal-rhel-8-hvm-2024.05.1.x86_64-gp3  |
|  ami-021ba76fc66135488 |  spel-minimal-rhel-8-hvm-2024.06.1.x86_64-gp3  |
+------------------------+------------------------------------------------+

(we have older than the above, just that the deprecation-tags mean that they won't show up in a search)

ferricoxide commented 2 months ago

This turned out to be a vendor (Red Hat) issue. Closing this case as there's (currently) nothing to be done via this project.

ferricoxide commented 1 month ago

Update:

Vendor-assigned engineer finally updated their Jira associated with this problem. That engineer has decided it's a WONTFIX because Red Hat 8 is too late in its lifecycle to be worth fixing what he characterized as a "nice to have" (poor word-choice: "rare" or "corner case" would probably have been a less-loaded choice). From the vendor's Jira (RHEL-5907):

Honestly, I don't think we should try to attempt to fix this issue at this time. RHEL-8 is at the point in its lifetime when nice-to-have fixes create unnecessary risk, specially fixes in core parts for systemd. I might reconsider if there is some business justification.