hashicorp / packer

Packer is a tool for creating identical machine images for multiple platforms from a single source configuration.
http://www.packer.io
Other
15.04k stars 3.32k forks source link

Packer Build Fails During Package Updates on EC2 (Successful on Local Machine) #13141

Open nuwan96 opened 1 month ago

nuwan96 commented 1 month ago

I'm facing an issue where my Packer build process on an EC2 instance fails intermittently during specific dnf package update steps. This occurs even before all defined packages are updated. The issue does not occur when running Packer on my local machine.

Setup:

Packer Instance: An EC2 instance is used to run the Packer build process. This instance can be replaced with my local machine, where the build works without issues.

Build Instance: A separate EC2 instance is spun up to run the actual Packer image creation process. The dnf updates occur on this instance.

> Important Note: The build process works successfully when executed on my local machine (Linux or Windows or MAC). The issue only arises when running packer Build command on an EC2 instance.

Problem: The Packer build process on EC2 appears to skip ahead to the next step in my provisioner script without completing all the dnf package updates. This behavior is inconsistent and happens before all the specified packages in my script are installed or updated.

Error Messages:

amazon-ebs.alma9: (228/535): low-memory-monitor-2.1-4.el9.x86_64. 3.4 MB/s | 35 kB 00:00 amazon-ebs.alma9: (229/535): lua-srpm-macros-1-6.el9.noarch.rpm 786 kB/s | 8.4 kB 00:00 amazon-ebs.alma9: (230/535): m17n-lib-1.8.0-13.el9.x86_64.rpm 14 MB/s | 194 kB 00:00 amazon-ebs.alma9: (231/535): m17n-db-1.8.0-16.el9.noarch.rpm 18 MB/s | 578 kB 00:00 amazon-ebs.alma9: (232/535): m4-1.4.19-1.el9.x8664.rpm 18 MB/s | 294 kB 00:00 amazon-ebs.alma9: (233/535): mesa-filesystem-23.3.3-1.el9.alma.1. 1.0 MB/s | 9.9 kB 00:00 amazon-ebs.alma9: (234/535): mesa-libEGL-23.3.3-1.el9.alma.1.x86 9.6 MB/s | 126 kB 00:00 amazon-ebs.alma9: (235/535): mesa-libGL-23.3.3-1.el9.alma.1.x866 14 MB/s | 167 kB 00:00 amazon-ebs.alma9: (236/535): mesa-libgbm-23.3.3-1.el9.alma.1.x86 3.6 MB/s | 37 kB 00:00 amazon-ebs.alma9: (237/535): mesa-libglapi-23.3.3-1.el9.alma.1.x8 4.1 MB/s | 43 kB 00:00 amazon-ebs.alma9: (238/535): llvm-libs-17.0.6-5.el9.x86_64.rpm 105 MB/s | 25 MB 00:00 ==> amazon-ebs.alma9: Provisioning with shell script: bench-packer-shared/scripts/worker/dependencies.sh

Expected Behavior:

I expect Packer to complete all dnf package updates specified in my provisioner script before moving on to subsequent steps. The build should not skip ahead prematurely.

Additional Information:

Packer Version: 1.9.0 OS/AMI: Alama Linuz 9 , Amazon Linux

Additional Notes:

This issue seems similar to a previously reported problem (

https://github.com/hashicorp/packer/issues/3920). However, the proposed solutions in that issue haven't fully resolved the problem in my case.

lbajolet-hashicorp commented 1 month ago

Hi @nuwan96,

This might be a tricky one if intermittent, and we've already had reports of package managers terminating before they're completely done for some reason, and so far we've not been able to determine what was the cause of this behaviour. Chances are yum/dnf ends prematurely for some reason (possibly OOM? Though that would make the command return non-zero, so likely something else), without returning a non-zero code, which doesn't prompt Packer to stop executing.

In order to help us understand the problem, would you be able to share a (ideally minimal) template that we can run tests with? If there are any, please include the scripts used for building the image. If possible also, would you be able to share detailed logs (with PACKER_LOG=1 enabled) for the problem, so we can inspect them?

Thanks!

lbajolet-hashicorp commented 1 month ago

Note: not transferring to the AWS plugin for now as it could be linked to SSH instead, so depending on the cause will transfer later as appropriate.

revarcline commented 2 weeks ago

I'm seeing a similar issue using the aws provisioner.

I have several dnf scripts that run fairy early in the build, after a first round of updates and a reboot. The first one seems to fail without an error after dnf finsishes the downloads, but before the install transactions begin. I broke what was one script originally up into smaller chunks to see if it would resolve this issue and while the build works locally, builds via my ci pipeline (docker container managed by gitlab-runner) consistently fail on the first script.

CI runner is on the Alma Linux 9 image and running packer version 1.11.2-1 from the hashicorp repo, amazon plugin version: github.com/hashicorp/amazon v1.3.2

Here's the output from PACKER_LOG=1, starting from the end of my first dnf script's sudden end going into the beginning of my second dnf script with some private information removed. The first dnf script, which cuts off, occurs after a reboot and a series of operations to copy files into /tmp.

See below:

    amazon-ebs.alma9: --------------------------------------------------------------------------------
    amazon-ebs.alma9: Total                                            98 MB/s | 300 MB     00:03
2024/08/23 18:13:34 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:34 [ERROR] Remote command exited without exit status or exit signal.
2024/08/23 18:13:34 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:34 [INFO] RPC endpoint: Communicator ended with: 2300218
2024/08/23 18:13:34 [INFO] 69193 bytes written for 'stdout'
2024/08/23 18:13:34 [INFO] 201 bytes written for 'stderr'
2024/08/23 18:13:34 [INFO] RPC client: Communicator ended with: 2300218
2024/08/23 18:13:34 [INFO] RPC endpoint: Communicator ended with: 2300218
2024/08/23 18:13:34 packer-provisioner-shell plugin: [INFO] 69193 bytes written for 'stdout'
2024/08/23 18:13:34 packer-provisioner-shell plugin: [INFO] 201 bytes written for 'stderr'
2024/08/23 18:13:34 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 2300218
2024/08/23 18:13:34 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:34 [DEBUG] Opening new ssh session
2024/08/23 18:13:34 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:34 [ERROR] ssh session open error: 'EOF', attempting reconnect
2024/08/23 18:13:34 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:34 [DEBUG] reconnecting to TCP connection for SSH
2024/08/23 18:13:34 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:34 [ERROR] reconnection error: dial tcp $REDACTED_PRIVATE_IP:22: connect: connection refused
2024/08/23 18:13:34 packer-provisioner-shell plugin: Retryable error: Error removing temporary script at /tmp/script_902.sh: dial tcp $REDACTED_PRIVATE_IP:22: connect: connection refused
2024/08/23 18:13:36 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:36 [DEBUG] Opening new ssh session
2024/08/23 18:13:36 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:36 [ERROR] ssh session open error: 'client not available', attempting reconnect
2024/08/23 18:13:36 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:36 [DEBUG] reconnecting to TCP connection for SSH
2024/08/23 18:13:43 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:43 [ERROR] reconnection error: dial tcp $REDACTED_PRIVATE_IP:22: connect: connection refused
2024/08/23 18:13:43 packer-provisioner-shell plugin: Retryable error: Error removing temporary script at /tmp/script_902.sh: dial tcp $REDACTED_PRIVATE_IP:22: connect: connection refused
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] Opening new ssh session
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [ERROR] ssh session open error: 'client not available', attempting reconnect
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] reconnecting to TCP connection for SSH
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] handshaking with SSH
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] handshake complete!
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [INFO] no local agent socket, will not connect agent
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] starting remote command: rm -f /tmp/script_902.sh
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [INFO] RPC endpoint: Communicator ended with: 0
2024/08/23 18:13:45 [INFO] RPC client: Communicator ended with: 0
2024/08/23 18:13:45 [INFO] RPC endpoint: Communicator ended with: 0
2024/08/23 18:13:45 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] Opening new ssh session
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] starting remote command: rm -f
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [INFO] RPC endpoint: Communicator ended with: 0
2024/08/23 18:13:45 [INFO] RPC client: Communicator ended with: 0
2024/08/23 18:13:45 [INFO] RPC endpoint: Communicator ended with: 0
2024/08/23 18:13:45 [INFO] (telemetry) ending shell
2024/08/23 18:13:45 [INFO] (telemetry) Starting provisioner shell
2024/08/23 18:13:45 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0
==> amazon-ebs.alma9: Provisioning with shell script: bench-packer-shared/scripts/worker/dnf2.sh
2024/08/23 18:13:45 packer-provisioner-shell plugin: Opening bench-packer-shared/scripts/worker/dnf2.sh for reading
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] Opening new ssh session
2024/08/23 18:13:45 packer-provisioner-shell plugin: [INFO] 385 bytes written for 'uploadData'
2024/08/23 18:13:45 [INFO] 385 bytes written for 'uploadData'
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] Starting remote scp process:  scp -vt /tmp
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] Started SCP session, beginning transfers...
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] Copying input data into temporary file so we can read the length
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] scp: Uploading script_7122.sh: perms=C0644 size=385
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] SCP session complete, closing stdin pipe.
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] Waiting for SSH session to complete.
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] scp stderr (length 65): Sink: C0644 385 script_7122.sh
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: debug1: fd 0 clearing O_NONBLOCK
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] Opening new ssh session
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] starting remote command: chmod 0755 /tmp/script_7122.sh
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [INFO] RPC endpoint: Communicator ended with: 0
2024/08/23 18:13:45 [INFO] RPC client: Communicator ended with: 0
2024/08/23 18:13:45 [INFO] RPC endpoint: Communicator ended with: 0
2024/08/23 18:13:45 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] Opening new ssh session
2024/08/23 18:13:45 packer-plugin-amazon_v1.3.2_x5.0_linux_amd64 plugin: 2024/08/23 18:13:45 [DEBUG] starting remote command: sudo -E sh -c 'BASE_IMAGE_DISTRO='AlmaLinux' CLIENT='cw' IMAGE_ARCH='x86' PACKER_BUILDER_TYPE='amazon-ebs' PACKER_BUILD_NAME='alma9' PLATFORM='aws'  /tmp/script_7122.sh'
==> amazon-ebs.alma9: + dnf install -y [redacted]

Again, I don't see this behavior when building from a local host, but it's consistently cutting off at this same point in the docker runner.

Worth noting that the shell provisioner uses these settings, as do the other working shell provisioners that successfully run dnf commands afterwards:

    expect_disconnect = true
    execute_command   = "sudo -E sh -c '{{ .Vars }} {{ .Path }}'"
revarcline commented 2 weeks ago

Figured out a dumb workaround, for what it's worth: I'm simply rerunning the same script afterwards, and it seems to work. (it has no effect if it works the first time)