Closed ferricoxide closed 2 months ago
@mrabe142
Ok, to make this testable in a scalable way, I've converted what you've described into a userData script. That script looks like:
#!/bin/bash
#
# Bail on errors
set -euo pipefail
#
# Be verbose
set -x
#
################################################################################
# Log everything below into syslog
exec 1> >( logger -s -t "$( basename "${0}" )" ) 2>&1
# Patch-up the system
dnf update -y
# Allocate the additional storage
if [[ $( mountpoint /boot/efi ) =~ "is a mountpoint" ]]
then
if [[ -d /sys/firmware/efi ]]
then
echo "Partitioning for EFI-enabled instance-type..."
else
echo "Partitioning for EFI-ready AMI..."
fi
growpart --free-percent=50 /dev/nvme0n1 4
lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
xfs_growfs /dev/mapper/RootVG-rootVol
lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
xfs_growfs /dev/mapper/RootVG-varVol
lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
xfs_growfs /dev/mapper/RootVG-auditVol
else
echo "Partitioning for BIOS-boot instance-type..."
growpart --free-percent=50 /dev/nvme0n1 2
lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
xfs_growfs /dev/mapper/RootVG-rootVol
lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
xfs_growfs /dev/mapper/RootVG-varVol
lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
xfs_growfs /dev/mapper/RootVG-auditVol
fi
# Reboot
systemctl reboot
The above should function for either BIOS-boot or EFI-boot EC2s launched from the various spel AMIs and replicate what you described as your process. Some notes:
The line:
set -euo pipefail
Causes the script to immediately abort if there are any errors in its execution. If this script aborts itself, the system will not reboot. Conversely, if you login to the system and check the boot-history (something like [[ $( journalctl --list-boots | wc -l ) -gt 1 ]] && echo "UserData succeded"
)
The line:
set -x
Makes the script "chatty" in its execution.
The line:
exec 1> >( logger -s -t "$( basename "${0}" )" ) 2>&1
Takes the userData script's execution-output (see previous bullet) and sends it to the syslog
service. You'll then be able to see all of the logged userData script's activity in /var/log/messages
by doing grep -P ' part-001\[\d*]: ' /var/log/messages
The line:
if [[ $( mountpoint /boot/efi ) =~ "is a mountpoint" ]]
Checks to see if the the path, /boot/efi
is a mountpoint (as it is on EC2s launched from EFI-ready AMIs). This selector determines whether to execute (your) growpart --free-percent=50 /dev/nvme0n1 4
or growpart --free-percent=50 /dev/nvme0n1 2
. Note that issuing the growpart
command is obviated if your userData payload includes a #cloud-config
section that looks like:
growpart:
mode: auto
devices: [
'/dev/xvda2',
'/dev/xvda4',
'/dev/nvme0n1p2',
'/dev/nvme0n1p4',
]
To launch a batch (of 30) instances, I use a BASH one-liner like:
mapfile -t INSTANCES < <(
aws ec2 run-instances \
--image-id ami-021ba76fc66135488 \
--instance-type t2.xlarge \
--subnet-id <SUBNET_ID> \
--security-group-id <SECURITY_GROUP_ID> \
--iam-instance-profile 'Name=<IAM_ROLE_NAME>'
--key-name <PROVISIONING_KEY_NAME> \
--block-device-mappings 'DeviceName=/dev/sda1,Ebs={
DeleteOnTermination=true,
VolumeType=gp3,
VolumeSize=100,
Encrypted=false
}' \
--user-data file:///tmp/userData.spel_695 \
--count 30 --query 'Instances[].InstanceId' \
--output text | \
tr '\t' '\n'
)
This saves all of the newly-launched instance's IDs to a BASH array (named INSTANCES
), that I can then loop over to do things like reboot each instance an arbitrary number of times and check that each has returned from said reboots.
As a quick "ferinstance": to reboot all of the instances in a batch (captured into the INSTANCES
array), one can do something like:
for INSTANCE in "${INSTANCES[@]}"
do
echo "Rebooting $INSTANCE..."
INSTANCE_IP="$(
aws ec2 describe-instances \
--instance-id "${INSTANCE}" \
--query 'Reservations[].Instances[].PrivateIpAddress' \
--output text
)
timeout 10 ssh "maintuser@${INSTANCE_IP}" "sudo systemctl reboot"
done
Note: after spinning up 30 t3 instances from the "This AMI has the booting issues" spel-minimal-rhel-8-hvm-2024.06.1.x86_64-gp3
AMI (ami-021ba76fc66135488
), I used the above userData to provision them followed by using the iterative-reboots script to reboot each of those instances about a dozen times. They all came back from each of their dozen-ish reboots. I can retry with the spel-minimal-rhel-8-hvm-2024.02.1.x86_64-gp3
(ami-092037daccf8526f7
) and spel-minimal-rhel-8-hvm-2024.05.1.x86_64-gp3
(ami-0455bbb8b742553ba
) AMIs, but I'm not optimistic that they're any more likely to fail than in the prior testing. Since I'm not seeing failures in the above-described testing, I have to assume there's something unique to your situation that you have as yet to adequately convey.
Ultimately, I would invite you to replicate what I've done and notify me if you have failures and/or provide a more-complete description of how to reproduce the issues you're seeing.
If you wish to further discuss but don't want to include potentially-sensitive information in this public discussion, you can email me at spel-mrabe142@xanthia.com (obviously, this is a "throwaway" address used for establishing initial, private communications betwen you and me).
One last question, @mrabe142: if you're currently finding success with deploying using the 03.2 AMI, are you able to patch it? Asking because, sometime after the introduction of the EFI-ready AMIs, we received reports that the /boot
partition was too small. While that issue was addressed on April 11th, I can't remember which AMI incorporated that fix. Which is to say, if you're planning to base your processes on using an older AMI is your solution-path, you might run into "not enough space in /boot
" issues when you go to do a dnf update
.
Yes, I do a dnf update
on all test instances, I did not encounter a problem with 03.2 but I imagine the space issue is what is causing 02.1 to kernel panic.
I did some preliminary testing using spel-minimal-rhel-8-hvm-2024.06.1.x86_64-gp3
I used the same userdata block as you above, same storage size.
I tested with two different instance types:
I ran two instances of each type. For type 1, both instances seemed fine (rebooted multiple times) For type 2, both instances had rebooting problems (on after instance launch, one after doing an SSH to the instance and rebooting)
The only differences were the instance type between the two. It might be worth spinning up with the second instance type and see if you see the same thing. I can run more tests when I have more time.
Ok. I'll try with the t3.2xlarge.
That said, the build-platform used for creating the Amazon Machine Images are t3.2xlarge
. Further, I've successfully used the resulting images with m6i, m7i and m7i-flex instance-families with 2xlarge, 4xlarge and 8xlarge sizes. Though, in fairness, I almost never resize the boot-EBS. Best practices for storage are to put OS data and application-data on separate devices. As a result, any storage-needs beyond what's needed for the base OS (e.g., to increase the size of /var/log
), I almost always do on secondary EBSes managed either as bare filesystem devices or as secondary LVM VGs & volumes (and, when this is for hosting /home
, I move /home
to a secondary EBS that overlays the trivial amount set aside for /home
in the default AMI – part of why that partition's size is paltry in our build-defaults).
At any rate, "I guess we'll see": it gives me one more thing to try out.
Alright. Interesting. Launched 50 of the t3.2xlarge instances using the spel-minimal-rhel-8-hvm-2024.06.1.x86_64-gp3
AMI with the previously-noted userData payload and had 11 reboot-failures. While any non-zero failure rate isn't acceptable, a 22% rate is significantly more than merely not acceptable.
I'll see what, if anything, I can do to get information from them.
Curiouser and curiouser…
I hadn't realized from prior communications that your reboot failures were leaving you at emergency-mode, so I hadn't included setting a root-password in my userData payload. So, I remedied that and launched a new set of 50. However, wanting to try to save money on the troubleshooting, I'd changed that batch to a t3.small
instance-type and got a 0% failure-rate. So, going to a try a couple of other instance-types to see if the problem is specific to t3.2xlarge
or if "some batches are luckier than others".
Oof… Switch back to t3.2xlarge and got 19/50 failures.
W T F
Just as a hedge against there being some kind of race-condition causing issues when executing:
growpart --free-percent=50 /dev/nvme0n1 4
lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
xfs_growfs /dev/mapper/RootVG-rootVol
lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
xfs_growfs /dev/mapper/RootVG-varVol
lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
xfs_growfs /dev/mapper/RootVG-auditVol
I changed it to the more-compact (and internally-enforced workflows):
growpart --free-percent=50 /dev/nvme0n1 4
lvresize -r --size=+4G /dev/mapper/RootVG-rootVol
lvresize -r --size=+8G /dev/mapper/RootVG-varVol
lvresize -r --size=+6G /dev/mapper/RootVG-auditVol
Sadly, made no difference in presence of reboot failures: the batch that I used the changed content in was 12/50.
Ok. This may be an issue with the systemd version in RHEL 8.x. When I looked up the error message coming from systemd's trying to mount /var/log/audit
:
# systemctl --no-pager status -l var-log-audit.mount
● var-log-audit.mount - /var/log/audit
Loaded: loaded (/etc/fstab; generated)
Active: failed (Result: protocol) since Mon 2024-07-01 13:27:30 UTC; 32min ago
Where: /var/log/audit
What: /dev/mapper/RootVG-auditVol
Docs: man:fstab(5)
man:systemd-fstab-generator(8)
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: Mounting /var/log/audit...
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: var-log-audit.mount: Mount process finished, but there is no mount.
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: var-log-audit.mount: Failed with result 'protocol'.
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: Failed to mount /var/log/audit.
I was able to turn up an issue filed in late 2018 against the systemd
project. That issue mentioned:
systemd version the issue has been seen with 239, 238, 237
And, when I check the failed EC2s, I see:
# systemctl --version
systemd 239 (239-82.el8)
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy
Why I didn't see this on t3
s smaller than 2xlarge
, this morning, I don't know. I'd try larger t3
instances types, but there aren't any. I'm going to try a couple of m-class batches in the 2xlarge
sizes to see if I get it there or not.
Just as a sanity-check I tried two more tests to validate that the problem only occurs when modifying either the "OS volumes" ( i.e., /
, /var
, /var/log
, /var/log/audit
) or if it's any modification to the boot volume-group:
As noted previously, haven't encountered issues when adding secondary EBS to host non-OS data. Similarly, our processes around creating AMIs only tests whether the boot-EBS can be grown, but does not make any LVM modifications or do any reboots.
At any rate, the manner in which we generally use Linux EC2s and test the AMIs we publish likely accounts for why the underlying problem hasn't previously manifested.
Going to open an issue with Red Hat.
In the interim, I would suggest not trying to grow or create further volumes within the root LVM2 volume-group. Which is to say, if you've got application-data that's been driving you to alter the boot volumes' sizes, place that data on LVM objects outside of the root volume-group.
Issue opened with Red Hat. Engineers are reviewing sos report
outputs for causes. However, initial response has been "this looks like it should be rebooting".
Red Hats has moved the case from their Storage team – the ones who deal with LVM, etc. – to their Services team – the ones that oversee systemd
-related issues. It's believed the issue is a race-condition in sytemd
's mount handlers.
In the near term, if you're expanding volumes to host application-data, switch to hosting that data on volumes separate from the Root volume-group and mount as appropriate. Otherwise, until Red Hat can identify a real fix, they recommend adding the nofail
option to the /var/log/audit
entry in the /etc/fstab
file.
Got a response back from the Service Team, last night.
Based on the findings so far, this looks to be related to a race condition in systemd 239. The upstream bug:
systemd: mount units fail with "Mount process finished, but there is no mount." · Issue #10872 · systemd/systemd · GitHub https://github.com/systemd/systemd/issues/10872
And the commit that fixes it:
mount: mark an existing "mounting" unit from /proc/self/mountinfo as ▒~@▒ · systemd/systemd@1d086a6 · GitHub https://github.com/systemd/systemd/commit/1d086a6e59729635396204fc05234f1d3caa0847
We have a Jira open for the behavior when seen with NFS mounts, but based on the upstream bug and commit, I'm not confident that the behavior is isolated to NFS remote mounts, but more so the race in general:
[RHEL-5907] NFSv3 remote mounts fail with "Mount process finished, but there is no mount." when a daemon-reload happens while mounting the remote mount - Red Hat Issue Tracker https://issues.redhat.com/browse/RHEL-5907
In the circumstance on this case, it seems to present the same behaviors, and systemd is reloaded in parallel with the mounting of the filesystem
Following that RHEL-5907 issue-link, looks like this has been going on since at least September. Not sure why any of the AMIs have worked for you. I don't have the time to verify, but I'm going to assume that the issue is present in all of our RHEL 8 AMIs:
---------------------------------------------------------------------------
| DescribeImages |
+------------------------+------------------------------------------------+
| ImageId | Name |
+------------------------+------------------------------------------------+
| ami-0b86f42e4059bde4b | spel-minimal-rhel-8-hvm-2023.12.1.x86_64-gp3 |
| ami-0363cf6882daf4895 | spel-minimal-rhel-8-hvm-2024.01.1.x86_64-gp3 |
| ami-092037daccf8526f7 | spel-minimal-rhel-8-hvm-2024.02.1.x86_64-gp3 |
| ami-0373eef9e2b3b4bcd | spel-minimal-rhel-8-hvm-2024.03.2.x86_64-gp3 |
| ami-0e770985d7a4b2822 | spel-minimal-rhel-8-hvm-2024.04.1.x86_64-gp3 |
| ami-0455bbb8b742553ba | spel-minimal-rhel-8-hvm-2024.05.1.x86_64-gp3 |
| ami-021ba76fc66135488 | spel-minimal-rhel-8-hvm-2024.06.1.x86_64-gp3 |
+------------------------+------------------------------------------------+
(we have older than the above, just that the deprecation-tags mean that they won't show up in a search)
This turned out to be a vendor (Red Hat) issue. Closing this case as there's (currently) nothing to be done via this project.
Update:
Vendor-assigned engineer finally updated their Jira associated with this problem. That engineer has decided it's a WONTFIX because Red Hat 8 is too late in its lifecycle to be worth fixing what he characterized as a "nice to have" (poor word-choice: "rare" or "corner case" would probably have been a less-loaded choice). From the vendor's Jira (RHEL-5907):
Honestly, I don't think we should try to attempt to fix this issue at this time. RHEL-8 is at the point in its lifetime when nice-to-have fixes create unnecessary risk, specially fixes in core parts for systemd. I might reconsider if there is some business justification.
Creating new ticket as it seems like OP for #691 may no longer be having issues (no comment since June 14th) and don't want to continue spamming if such is the case.
Originally posted by @mrabe142 in https://github.com/plus3it/spel/issues/691#issuecomment-2168365155:
And, per @mrabe142 in https://github.com/plus3it/spel/issues/691#issuecomment-2187415382:
And, per @mrabe142 in https://github.com/plus3it/spel/issues/691#issuecomment-2189641239: