hashicorp / packer

Packer is a tool for creating identical machine images for multiple platforms from a single source configuration.
http://www.packer.io
Other
15.05k stars 3.32k forks source link

Packer qemu builder crashes sometimes (yamux keepalive, qemu EOF) #13093

Open WorldBuild opened 3 months ago

WorldBuild commented 3 months ago

Hello, I use Packer often to build qcow2 images from other qcow2 images, and once in a blue moon, at a random time during the build, Packer exits abruptly due to something related to Yamux

I run my builds on CentOS 8, and this happens even when I have no provisioners, so I think the most likely culprit is the communicator I use (WinRM)

Simplified template

{
  "builders": [
    {
      "vm_name": "test-crashing-builds",
      "iso_url": "/packer_builds/fresh_windows_2022_server.qcow2",
      "iso_checksum": "none",
      "output_directory": "/packer_builds/out/test_server.qcow2",
      "disk_size": "300000M",
      "memory": 4096,
      "shutdown_command": "shutdown /s /t 10 /f /d p:0:0",
      "shutdown_timeout": "180m",
      "boot_wait": "0s",
      "cd_files": [],
      "floppy_files": [],
      "accelerator": "kvm",
      "boot_command": "",
      "communicator": "winrm",
      "disk_interface": "virtio",
      "format": "qcow2",
      "headless": true,
      "net_device": "virtio-net",
      "type": "qemu",
      "efi_boot": true,
      "efi_firmware_code": "/packer_builds/OVMF_CODE.fd",
      "efi_firmware_vars": "/packer_builds/OVMF_VARS.fd",
      "machine_type": "q35",
      "winrm_password": "...",
      "winrm_username": "...",
      "winrm_no_proxy": true,
      "disk_image": true,
    }
  ],
  "provisioners": [] #I use powershell and windows-restart provisioners
}

Log doesn't help much as it just shows the error message with yamux keepalive and qemu EOF

[INFO] Packer version: 1.9.4 [go1.20.7 linux amd64]
[TRACE] discovering plugins in .
[TRACE] discovering plugins in /.config/packer/plugins
[INFO] PACKER_CONFIG env var not set; checking the default config file path
[INFO] PACKER_CONFIG env var set; attempting to open config file: /.packerconfig
[WARN] Config file doesn't exist: /.packerconfig
[INFO] Setting cache directory: /.cache/packer
e: Running in background, not using a TTY
ui: Warning: Bundled plugins used

This template relies on the use of plugins bundled into the Packer binary.
The practice of bundling external plugins into Packer will be removed in an
upcoming version.

To remove this warning and ensure builds keep working you can install these
external plugins with the 'packer plugins install' command

* packer plugins install github.com/hashicorp/qemu

Alternatively, if you upgrade your templates to HCL2, you can use 'packer init'
with a 'required_plugins' block to automatically install external plugins.

You can try HCL2 by running 'packer hcl2_upgrade
/packer_builds/template.json'

[INFO] Starting internal plugin packer-builder-qemu
packer-builder-qemu plugin: [INFO] PACKER_CONFIG env var not set; checking the default config file path
packer-builder-qemu plugin: [INFO] PACKER_CONFIG env var set; attempting to open config file: /.packerconfig
packer-builder-qemu plugin: [WARN] Config file doesn't exist: /.packerconfig
packer-builder-qemu plugin: [INFO] Setting cache directory: /.cache/packer
packer-builder-qemu plugin: args: []string{"packer-builder-qemu"}
packer-builder-qemu plugin: Plugin address: unix /bin/packer/packer-plugin2759562768
packer-builder-qemu plugin: Waiting for connection...
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin2759562768
packer-builder-qemu plugin: Serving a plugin connection...
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin4152808466
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin4222281818
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin773823907
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin3442804825
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin802013099
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin390940667
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin3622088650
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin1495955599
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin1358925819
Waiting for RPC address for: /packer/1.9.4/__packer__/packerReceived unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin1737098876
Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin3800178001Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin1077889329Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin1013486710Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin2732238272Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin1290914942Waiting for RPC address for: /packer/1.9.4/__packer__/packer
Received unix RPC address for /packer/1.9.4/__packer__/packer: addr is /bin/packer/packer-plugin2072334252Preparing build: qemu
packer-builder-qemu plugin: use specified accelerator: kvm
ui: Warning: Warning when preparing build: "qemu"A checksum of 'none' was specified. Since ISO files are so big,
a checksum is highly recommended.ui: qemu: output will be in this color.
ui: 
Build debug mode: false
Force build: false
On error: cleanup
Waiting on builds to complete...
Starting build run: qemu
Running builder: qemu
packer-builder-qemu plugin: Qemu path: /bin/qemu-system-x86_64, Qemu Image path: /bin/qemu-img
ui: ==> qemu: Retrieving ISO
ui: ==> qemu: Trying /packer_builds/fresh_windows_2022_server.qcow2
packer-builder-qemu plugin: Acquiring lock for: /packer_builds/fresh_windows_2022_server.qcow2 (/.cache/packer/b09a8cfc0a89165c4e3d4c841c4c30b1550a12e0.iso.lock)
ui: ==> qemu: Trying /packer_builds/fresh_windows_2022_server.qcow2
ui: ==> qemu: /packer_builds/fresh_windows_2022_server.qcow2 => /packer_builds/fresh_windows_2022_server.qcow2
packer-builder-qemu plugin: Leaving retrieve loop for ISO
packer-builder-qemu plugin: No floppy files specified. Floppy disk will not be made.
packer-builder-qemu plugin: No CD files specified. CD will not be made.
ui: ==> qemu: Copying hard drive...
packer-builder-qemu plugin: Executing qemu-img: []string{"convert", "-O", "qcow2", "/packer_builds/fresh_windows_2022_server.qcow2", "/packer_builds/out/test_server.qcow2"}
packer-builder-qemu plugin: stdout:
packer-builder-qemu plugin: stderr:
ui: ==> qemu: Resizing hard drive...
packer-builder-qemu plugin: Executing qemu-img: []string{"resize", "-f", "qcow2", "/packer_builds/out/test_server.qcow2", "300000M"}
packer-builder-qemu plugin: stdout: Image resized.
packer-builder-qemu plugin: stderr:
packer-builder-qemu plugin: Looking for available communicator (SSH, WinRM, etc) port between 2222 and 4444
packer-builder-qemu plugin: Found available port: 3282 on IP: 127.0.0.1
ui: ==> qemu: Found port for communicator (SSH, WinRM, etc): 3282.
ui: ==> qemu: Looking for available port between 5900 and 6000 on 127.0.0.1
packer-builder-qemu plugin: Looking for available port between 5900 and 6000 on 127.0.0.1
packer-builder-qemu plugin: Found available port: 5916 on IP: 127.0.0.1
packer-builder-qemu plugin: Found available VNC port: 5916 on IP: 127.0.0.1
[ERR] yamux: keepalive failed: i/o deadline reached
ui error: Build 'qemu' errored after 2 minutes 39 seconds: unexpected EOF
ui: 
==> Wait completed after 2 minutes 39 seconds
machine readable: error-count []string{"1"}
ui error: 
==> Some builds didn't complete successfully and had errors:
machine readable: qemu,error []string{"unexpected EOF"}
ui error: --> qemu: unexpected EOF
ui: 
==> Builds finished but no artifacts were created.
github-actions[bot] commented 3 months ago
Hi 👋 thanks for reaching out.

For general questions we recommend reaching out to the [community forum](https://discuss.hashicorp.com/c/packer) for greater visibility.
As the GitHub issue tracker is only watched by a small subset of maintainers and is really reserved for bugs and enhancements, you'll have a better chance of finding someone who can help you in the forum.
We'll mark this issue as needs-reply to help inform maintainers that this question is awaiting a response.
If no activity is taken on this question within 30 days it will be automatically closed.

If you find the forum to be more helpful or if you've found the answer to your question elsewhere please feel free to post a response and close the issue.
lbajolet-hashicorp commented 2 months ago

Hi @WorldBuild,

Thanks for bubbling this up! This looks like a bug indeed, no clue if that's on us or if it's an upstream problem (yamux is the muxer we use over the wire for allowing multiple entities to communicate over one single connection), but it warrants an investigation.

Judging from the logs it seems this occurs rapidly during the course of a build, right?

Looking at your logs it seems you're running Packer 1.9.4, any chance you can try on a more recent version to see if the problem persists? I'd be surprised if it did tbh as we didn't change that piece of code for a while, but maybe something else positively interacts with it? If you update, please note that you'll need to manually install the qemu plugin so Packer can use it, you can do something as simple as packer plugins install github.com/hashicorp/qemu for this. Feel free to read our documentation on this subject too if you want more information.

WorldBuild commented 2 months ago

Hi @WorldBuild,

Thanks for bubbling this up! This looks like a bug indeed, no clue if that's on us or if it's an upstream problem (yamux is the muxer we use over the wire for allowing multiple entities to communicate over one single connection), but it warrants an investigation.

Looking at your logs it seems you're running Packer 1.9.4, any chance you can try on a more recent version to see if the problem persists? I'd be surprised if it did tbh as we didn't change that piece of code for a while, but maybe something else positively interacts with it? If you update, please note that you'll need to manually install the qemu plugin so Packer can use it, you can do something as simple as packer plugins install github.com/hashicorp/qemu for this. Feel free to read our documentation on this subject too if you want more information.

Hey @lbajolet-hashicorp! Thank you for the detailed response and help! I'll try upgrading to a newer version and test it out for a few weeks, and reportback my results.

Judging from the logs it seems this occurs rapidly during the course of a build, right?

Yep I can confirm that, it happens near instantly, I remember when researching this 2 weeks back I saw a discussion that keepalive can be configured to have a longer timeout or disabled entirely but I couldn't find the source again :/

For future reference, is there any way I can pull out more detailed logs when crashes such as this happen? So far I've been relying on PACKER_LOG but it didn't help me much in this case since I honestly still have no idea what actually crashed 😅 (if anything)

EDIT: cat walked on my keyboard

lbajolet-hashicorp commented 2 months ago

For the verbose logs unfortunately I'm not sure what more we can do :/ The error you see comes straight from yamux, so unless it can give us more context on the keepalive/deadline problem, not sure we have much more to report here.

It warrants some delving in its code base though, which will probably have to happen to understand how we end-up here, but since that code is a bit complex, and stuff occurs on different coroutines, it might be a bit hard to troubleshoot unfortunately.

Out of curiosity, do you hit this often?

As for the WinRM hypothesis, it could be I imagine? Though that seems unlikely, as yamux only manages communication with the plugin, not with the VM directly (from my understanding at least), but it could be a side-effect possibly.

Will investigate!

WorldBuild commented 2 months ago

For the verbose logs unfortunately I'm not sure what more we can do :/ The error you see comes straight from yamux, so unless it can give us more context on the keepalive/deadline problem, not sure we have much more to report here.

It warrants some delving in its code base though, which will probably have to happen to understand how we end-up here, but since that code is a bit complex, and stuff occurs on different coroutines, it might be a bit hard to troubleshoot unfortunately.

Out of curiosity, do you hit this often?

As for the WinRM hypothesis, it could be I imagine? Though that seems unlikely, as yamux only manages communication with the plugin, not with the VM directly (from my understanding at least), but it could be a side-effect possibly.

Will investigate!

Thanks for investigating, to answer your question, no I don't hit this very often, so it's not a critical issue

Digging around in the Yamux codebase, the error seems to be thrown from this code block, after Yamux times out https://github.com/hashicorp/yamux/blob/d1caa6c97c9fc1cc9e83bbe34d0603f9ff0ce8bd/session.go#L303-L309

could bumping the ConnectionWriteTimeout here help? https://github.com/hashicorp/yamux/blob/d1caa6c97c9fc1cc9e83bbe34d0603f9ff0ce8bd/mux.go#L16-L27

Ideally if we could configure this in the Packer template or via an environmental variable (like PACKER_LOG), I could test it out for you :)

lbajolet-hashicorp commented 2 months ago

Hi @WorldBuild,

Thanks for the delve into the yamux code, much appreciated! Bumping the timeout might be a fix for your case yes, I'm not sure by how much though, and for problems like this it's unfortunately not a one-value-fits-all kind of situation I'm afraid, still it's worth looking into.

If you're able to experiment with it and try out what would feel good to you please do, and if you want to add an option to the configurations in the SDK please do and open a PR if that might be helpful to others :)

Thanks!