hashicorp / packer

Packer is a tool for creating identical machine images for multiple platforms from a single source configuration.
http://www.packer.io
Other
15.04k stars 3.32k forks source link

Shell provisioner random script failures #12908

Open Stromweld opened 5 months ago

Stromweld commented 5 months ago

Community Note

Overview of the Issue

using the shell provisioner with an array of scripts randomly I get an error when packer tries to run one

Reproduction Steps

run bento builds https://github.com/chef/bento

git clone https://github.com/chef/bento
cd bento
packer init -upgrade ./packer_templates
packer build -only=virtualbox-iso.vm -var-file=os_pkrvars/fedora/fedora-39-x86_64.pkrvars.hcl ./packer_templates

Packer version

v1.10.2

Operating system and Environment details

Ubuntu 22.04 github actions runner. https://github.com/chef/bento/actions/runs/8601528699/job/23569023182

Log Fragments and crash.log files

2024-04-08T14:15:51Z: ==> virtualbox-ovf.vm: Provisioning with shell script: ../../packer_templates/scripts/_common/motd.sh 2024-04-08T14:16:28Z: ==> virtualbox-ovf.vm: sh: /tmp/script_2235.sh: No such file or directory 2024-04-08T14:16:28Z: ==> virtualbox-ovf.vm: Provisioning step had errors: Running the cleanup provisioner, if present... 2024-04-08T14:16:28Z: ==> virtualbox-ovf.vm: Cleaning up floppy disk... 2024-04-08T14:16:28Z: ==> virtualbox-ovf.vm: Deregistering and deleting imported VM... 2024-04-08T14:16:29Z: ==> virtualbox-ovf.vm: Deleting output directory... 2024-04-08T14:16:29Z: Build 'virtualbox-ovf.vm' errored after 1 minute 33 seconds: Script exited with non-zero exit status: 1. Allowed exit codes are: [0]

lbajolet-hashicorp commented 5 months ago

Hi @Stromweld,

Thanks for the bug report! This looks indeed like a Packer problem, we do create temporary scripts with the shell provisioner that we copy to the target's /tmp directory before executing it through ssh.

If you're able to reliably reproduce this error, is there a chance you could run the build with --debug or something like --on-error=ask (or abort, whichever you prefer), this way you will be able to ssh into the VM. I'm interested in knowing if the shell script was actually copied, if it wasn't truncated for whatever reason, or if it was actually copied in the right place. Might be also interesting to take a look at the mounts at the same time, I can't rule out /tmp being shadowed by another mount possibly.

Let me know what you're able to figure out, and if you need a hand let me know!

Stromweld commented 5 months ago

it seems to happen the most on my Bento fedora builds when it gets to the build-tools_fedora.sh script. I updated the reproduction steps for it. I'll give that and see if I can find more information.

Stromweld commented 5 months ago

I've also seen some random behavior where a script is executed and in red I see the scripts contents but the stream doesn't show any further details and the commands don't seem to be actually running. This happens the most on Fedora 14 build when it gets to the vagrant script to install the vagrant users insecure key the wget command doesn't show any output and the end box created vagrant isn't able to ssh to.

Stromweld commented 5 months ago

@lbajolet-hashicorp I was able to replicate it and have PACKER_LOG=1 set to get debug output. https://github.com/chef/bento/actions/runs/8649779119/job/23716839452

lbajolet-hashicorp commented 3 months ago

Hey @Stromweld,

Looking back to this now, thanks for being able to replicate and share the logs, but on their own they won't help us troubleshoot the problem. The logs are verbose, but not enough to understand what is the root cause of the issue, especially if this is some random occurrence.

Would you be able to produce a minimal template that we can run locally on a hypervisor? It can be extracted from bento, no problem with that, but ideally I'd like something that doesn't have too many dependencies/local files so we can iterate efficiently on this.

I have a hunch that it's probably our scp that failed to write the file in place, but it's hard to say what is exactly the problem without a live VM to debug this on, as if we have that we can connect into it and look at the filesystem after each step. The chmod command seems to fail though, and I'm surprised this doesn't mean the end of the process at this point. The exit code is non-zero, but I'm not sure if there's a specific meaning to this other than "it failed".

All in all, I need your help with this here, if you are able to reliably replicate this problem and share some configuration we can run on to troubleshoot that, we can look into it. Otherwise it will be exceedingly hard for us to investigate, and we cannot prioritise this in the current state.

Thanks in advance, and apologies I didn't come back to you earlier!

Stromweld commented 3 months ago

I can try, it appears to be random if and when it fails. The one that seems to fail the most is for parallels when installing the guest tools if deps are installed first via apt, dnf, etc... it seems to do that then skip the rest of the script. In this case the build succeeded but testing fails as the prl_fs driver isn't available. I'll try to put something together tomorrow.

lbajolet-hashicorp commented 3 months ago

This would be awesome, many thanks @Stromweld !