hashicorp / packer

Packer is a tool for creating identical machine images for multiple platforms from a single source configuration.
http://www.packer.io
Other
15k stars 3.32k forks source link

How to reboot during provisioning? Request for docs or feature #11190

Open robotrapta opened 2 years ago

robotrapta commented 2 years ago

Description

Howdy y'all! I need to restart my machine during provisioning. I'm new to packer coming from systems like ansible and chef. I've read a bunch of docs on this, and am still confused. So I think this is at least a doc bug, and perhaps a full feature request.

I know this issue has been discussed in https://github.com/hashicorp/packer/issues/1983 and that there was a proposal to build a native provisioner in https://github.com/hashicorp/packer/pull/4555 - a native feature makes a ton of sense to me.

I know the recommended way to do this is with a shell provisioner. This seems to rely on the retry mechanism, treating the reboot as a kinda-expected error and then getting the system to recover from it appropriately. Maybe this is useful for other reasons, but it feels like an ugly hack. A hack would be okay if it was clearly documented and worked reliably. But it's not clearly documented - there is no example that I can find showing how to do this. My first few attempts to get it to work after reading the docs were unreliable - sometimes it failed, and sometimes it re-ran things unnecessarily. So it would be great to just tell people the standard way to do this if there isn't a built-in way to do it.

I think a good way to do it is this:

  provisioner "shell" {
    expect_disconnect = true
    inline = [
      "sudo reboot now",
    ]
    pause_after  = "10s"
  }

The pause_after I believe is important to minimize the risk of an expected race condition in issuing the next provisioning command? Which seems to me like a pretty strong argument for making this a native feature.

If that's in fact correct, putting that example code in the docs would be awesome. Thanks!

Use Case(s)

I'm trying to install nvidia drivers to use CUDA, which generally requires a reboot.

azr commented 2 years ago

Hello @robotrapta ! Thanks for opening, yeah, I agree with you here about the fact that this feel like a hack. The thing is that Packer sorts of expects to log in to an instance once — in the beginning — and there is no internal/native way to 'reconnect' that feel great. So we would like to introduce a new feature soon to be able to 'connect' in the middle of a build. This would allow changing SSH settings or reboot after an installation. But that one will not come straight away, as we have quite a large and growing to-do list. Making a docs page about that would be a good idea, I think. I'll bring that one up to the team.

One thing that comes to mind here is that you could install the drivers at the end of your provisioning steps, and just shutdown/save the machine. Upon next boot, things should be configured. If you have more things to install/configure, then you could for example start another build ?

With that said, and if that does not work out, do you mind sharing your build file ? And your logs ? Maybe we can help you better/differently from there.

keviiin38 commented 2 years ago

Hello ! I'm trying to achieve something similar:

{
    "provisioners": [
        {
            "type": "ansible",
            "playbook_file": "playbook.yml"
        },
        {
            "type": "shell",
            "inline": [ "reboot now" ],
            "expect_disconnect": true
        },
        {
            "type": "file",
            "source": "serverspec/",
            "destination": "/tmp",
            "pause_before": "30s"
        },
        {
            "type": "shell",
            "script": "serverspec.sh"
        }
    ]
}

I'm using pause_before in the next provisioner instead of the pause_after, but don't know which one is better.

robotrapta commented 2 years ago

Hi @azr thanks for the suggestion of installing the drivers at the end - unfortunately that doesn't work for me. There's a bunch of software I need to install which depends on having CUDA installed, and some of those installations will fail if it can't confirm nvidia hardware/drivers present.

ccrvlh commented 2 years ago

Hi, similar use case here, and found the results to be sort of inconsistent. My current template is something like:


  provisioner "shell" {
    pause_before = "10s"
    script = "./scripts/updates.sh"
  }

  provisioner "ansible-local" {
    role_paths      = ["./roles"]
    playbook_file   = "./roles/ubuntu/ubuntu.yml"
    command         = "sudo ansible-playbook -i localhost -e 'ansible_python_interpreter=/usr/bin/python3'"
  }

  provisioner "shell" {
    script = "./scripts/reboot.sh"
    expect_disconnect = true
  }

  provisioner "shell" {
    pause_before = "120s"
    script = "./scripts/finish.sh"
    max_retries = 3
  }

What I've found is that I always can see the message that the reboot.sh prints (Rebooting to apply updates), and sometimes I see the next message (Pausing for 2 minutes before next text or something similar, can't remeber). But sometimes I the build just fails after the reboot message. This seems strange, since I've been following the VMs and reboot usually take around 30-60 seconds, and I do have the retry mechanism for the finish provisioner. Couldn't quite find a reliable way to do that. I ran about 10-20 builds today and is completely hit or miss.

I understand the solution proposed by @azr but just as @robotrapta I also wanted to perform actions after the reboot. There may be a work around, sure, but the reboot is just the natural way to go for us.

The impression I get is that Packer "crashes" (probably not the right wording, pardon me) after the reboot, even with expect_disconnect and doesn't understand the next task to perform (wait to reconnect).

I could try a few more tests with debug logs on maybe.

github-actions[bot] commented 1 year ago

This issue has been synced to JIRA for planning.

JIRA ID: HPR-770

spuder commented 1 year ago

A request on how to reboot came up in the discuss group https://discuss.hashicorp.com/t/how-to-reboot-vm-with-packer/46083/2

You are already using pause_before and pause_after. Try also adding ssh_read_write_timeout = 5m.

e.g.

source "amazon-ebs" "ubuntu-bionic" {
  ami_name      = "ubuntu-bionic-18.04-hvm-ebs-{{timestamp}}"
  instance_type = "t2.micro"
  region        = "us-west-2"
  source_ami_filter {
    filters = {
      name                = "ubuntu/images/*ubuntu-bionic-18.04-amd64-server-*"
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    }
    most_recent = true
    owners      = ["099720109477"]
  }
  ssh_username    = "ubuntu"
  ssh_read_write_timeout = "5m" # Allow reboots
}
brianjmurrell commented 1 year ago

Or simply because you need to reboot to a newer kernel, installed during the provisioning in order to be able to remove the one that is currently booted.

ghost commented 1 year ago

any recommended way to reboot during provisioning?

RaJiska commented 8 months ago

The documentation actually makes mention of it.

Firstly with expect_disconnect meant to be used in your provisioner rebooting your machine, and secondly with start_retry_timeout to be used in your subsequent shell provisioner.

tenthirtyam commented 1 month ago

Agree that this is already available with use of pause_before, expect_disconnect and start_retry_timeout in the shell provisioner.

pause_before (duration) - Sleep for duration before execution.

expect_disconnect (boolean) - Defaults to false. When true, allow the server to disconnect from Packer without throwing an error. A disconnect might happen if you restart the SSH server or reboot the host.

start_retry_timeout (string) - The amount of time to attempt to start the remote process. By default this is 5m or 5 minutes. This setting exists in order to deal with times when SSH may restart, such as a system reboot. Set this to a higher value if reboots take a longer amount of time.

Additionally, this is available for Windows with the windows-restart provisoner.

provisioner "windows-restart" {
  pause_before          = "30s"
  restart_check_command = "powershell -command \"& {Write-Output 'restarted.'}\""
  restart_timeout       = "10m"
}

cc @nywilken @lbajolet-hashicorp

lbajolet-hashicorp commented 1 month ago

Thanks for the update here @tenthirtyam,

This is an old issue, we do have documentation on those options, but maybe the workflow isn't intuitive, or the documentation is lacking.

That said, since this hasn't been updated for a while, and most of the updates seem to point to community resources or sharing examples of how the problem was fixed/circumvented.

I'm tempted to close this issue now, but I'd like to hear from others that commented on this issue before: are you still experiencing the problem? Do you have suggestions on how we can improve Packer or the docs that would've helped you solve that issue?

jaysoffian commented 1 month ago

None of the suggestions work reliably, at least, not in combination with the AWS session-manager-plugin. I've tried adding pause_after to the step that reboots, pause_before to the step following the reboot. I've tried adding an interim shell-local step. I've tried adding retries. I've tried setting ssh_read_write_timeout to something low like 1m, but that timeout doesn't seem to apply until after the connection has started.

What seems to be happening is that the session-manager-plugin itself does not reliably notice the remote end has gone away and we cannot adjust its timeout which is apparently 1 hour. Once it's finally timed out it's too late.

Here's an example of what it looks like when it works: ``` 1 2024/06/21 14:45:35 ui: 2024-06-21T14:45:35-04:00: amazon-ebs.builder: + shutdown -r +1s 2 2024/06/21 14:45:35 ui: 2024-06-21T14:45:35-04:00: amazon-ebs.builder: shutdown: [pid 1018] 3 2024/06/21 14:45:35 ui: 2024-06-21T14:45:35-04:00: amazon-ebs.builder: Shutdown at Fri Jun 21 18:45:36 2024. 4 2024/06/21 14:45:35 ui: 2024-06-21T14:45:35-04:00: amazon-ebs.builder: shutdown: can't detach from console 5 2024/06/21 14:45:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:45:36 [INFO] RPC endpoint: Communicator ended with: 0 6 2024/06/21 14:45:36 [INFO] 1639 bytes written for 'stdout' 7 2024/06/21 14:45:36 [INFO] 0 bytes written for 'stderr' 8 2024/06/21 14:45:36 [INFO] RPC client: Communicator ended with: 0 9 2024/06/21 14:45:36 [INFO] RPC endpoint: Communicator ended with: 0 10 2024/06/21 14:45:36 ui: 2024-06-21T14:45:36-04:00: amazon-ebs.builder: Shutdown at Fri Jun 21 18:45:36 2024. 11 2024/06/21 14:45:36 packer-provisioner-shell plugin: [INFO] 1639 bytes written for 'stdout' 12 2024/06/21 14:45:36 packer-provisioner-shell plugin: [INFO] 0 bytes written for 'stderr' 13 2024/06/21 14:45:36 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0 14 2024/06/21 14:45:36 ui: 2024-06-21T14:45:36-04:00: amazon-ebs.builder: 15 2024/06/21 14:45:36 ui: 2024-06-21T14:45:36-04:00: amazon-ebs.builder: System shutdown time has arrived 16 2024/06/21 14:45:36 ui: 2024-06-21T14:45:36-04:00: ==> amazon-ebs.builder: Pausing 1m0s after this provisioner... 17 2024/06/21 14:46:25 ui: 2024-06-21T14:46:25-04:00: amazon-ebs.builder: SessionId: user.XXXXXXXX : document process failed unexpectedly: document worker timed out , check [ssm-document-worker]/[ssm-session-worker] log for crash reason 18 2024/06/21 14:46:25 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:25 ssm: Starting PortForwarding session to instance i-XXXXXXXX 19 2024/06/21 14:46:25 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:25 ssm: Terminating PortForwarding session "user.XXXXXXXX" 20 2024/06/21 14:46:26 ui: 2024-06-21T14:46:26-04:00: amazon-ebs.builder: Starting portForwarding session "user.XXXXXXXX". 21 2024/06/21 14:46:26 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:26 Executing: /opt/aws/sessionmanagerplugin/bin/session-manager-plugin [{"SessionId":"XXXXXXXX","StreamUrl":"wss://XXXXXXXX"} XXXXXXXX StartSession {"DocumentName":"AWS-StartPortForwardingSession","Parameters":{"localPortNumber":["8920"],"portNumber":["22"]},"Reason":null,"Target":"i-XXXXXXXX"} wss://XXXXXXXX] 22 2024/06/21 14:46:26 ui: 2024-06-21T14:46:26-04:00: amazon-ebs.builder: Starting session with SessionId: user.XXXXXXXX 23 2024/06/21 14:46:33 ui: 2024-06-21T14:46:33-04:00: amazon-ebs.builder: Port 8920 opened for sessionId user.XXXXXXXX. 24 2024/06/21 14:46:33 ui: 2024-06-21T14:46:33-04:00: amazon-ebs.builder: Waiting for connections... 25 2024/06/21 14:46:36 [INFO] (telemetry) ending shell 26 2024/06/21 14:46:36 [INFO] (telemetry) Starting provisioner shell 27 2024/06/21 14:46:36 ui: 2024-06-21T14:46:36-04:00: ==> amazon-ebs.builder: Provisioning with shell script: ./stage2.setup_ami.sh 28 2024/06/21 14:46:36 packer-provisioner-shell plugin: Opening ./stage2.setup_ami.sh for reading 29 2024/06/21 14:46:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:36 [DEBUG] Opening new ssh session 30 2024/06/21 14:46:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:36 [ERROR] ssh session open error: 'EOF', attempting reconnect 31 2024/06/21 14:46:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:36 [DEBUG] reconnecting to TCP connection for SSH 32 2024/06/21 14:46:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:36 [DEBUG] handshaking with SSH 33 2024/06/21 14:46:36 ui: 2024-06-21T14:46:36-04:00: amazon-ebs.builder: Connection accepted for session [user.XXXXXXXX] 34 2024/06/21 14:46:36 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData' 35 2024/06/21 14:46:36 [INFO] 18451 bytes written for 'uploadData' 36 2024/06/21 14:46:37 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:37 [DEBUG] handshake complete! 37 2024/06/21 14:46:37 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:37 [DEBUG] Opening new ssh session 38 2024/06/21 14:46:37 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:37 [INFO] agent forwarding enabled 39 2024/06/21 14:46:37 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:37 [DEBUG] Starting remote scp process: scp -vt /tmp 40 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Started SCP session, beginning transfers... 41 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Copying input data into temporary file so we can read the length 42 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] scp: Uploading script_5175.sh: perms=C0644 size=18451 43 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] SCP session complete, closing stdin pipe. 44 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Waiting for SSH session to complete. 45 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] scp stderr (length 72): Sink: C0644 18451 script_5175.sh 46 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: scp: debug1: fd 0 clearing O_NONBLOCK 47 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Opening new ssh session 48 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] starting remote command: chmod 0755 /tmp/script_5175.sh 49 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [INFO] RPC endpoint: Communicator ended with: 0 50 2024/06/21 14:46:38 [INFO] RPC client: Communicator ended with: 0 51 2024/06/21 14:46:38 [INFO] RPC endpoint: Communicator ended with: 0 52 2024/06/21 14:46:38 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0 53 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Opening new ssh session 54 2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] starting remote command: sudo /tmp/script_5175.sh setup2 55 2024/06/21 14:46:38 ui: 2024-06-21T14:46:38-04:00: amazon-ebs.builder: + uptime ```

The important line to notice is the aws session-manager-plugin disconnecting on line 17.

Now here's the same configuration failing: ``` 1 2024/06/21 15:21:00 ui: 2024-06-21T15:21:00-04:00: amazon-ebs.builder: + shutdown -r +1s 2 2024/06/21 15:21:00 ui: 2024-06-21T15:21:00-04:00: amazon-ebs.builder: shutdown: [pid 1005] 3 2024/06/21 15:21:00 ui: 2024-06-21T15:21:00-04:00: amazon-ebs.builder: Shutdown at Fri Jun 21 19:21:01 2024. 4 2024/06/21 15:21:00 ui: 2024-06-21T15:21:00-04:00: amazon-ebs.builder: shutdown: can't detach from console 5 2024/06/21 15:21:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:21:01 [INFO] RPC endpoint: Communicator ended with: 0 6 2024/06/21 15:21:01 [INFO] 0 bytes written for 'stderr' 7 2024/06/21 15:21:01 [INFO] 1639 bytes written for 'stdout' 8 2024/06/21 15:21:01 [INFO] RPC client: Communicator ended with: 0 9 2024/06/21 15:21:01 [INFO] RPC endpoint: Communicator ended with: 0 10 2024/06/21 15:21:01 packer-provisioner-shell plugin: [INFO] 1639 bytes written for 'stdout' 11 2024/06/21 15:21:01 packer-provisioner-shell plugin: [INFO] 0 bytes written for 'stderr' 12 2024/06/21 15:21:01 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0 13 2024/06/21 15:21:01 ui: 2024-06-21T15:21:01-04:00: amazon-ebs.builder: Shutdown at Fri Jun 21 19:21:01 2024. 14 2024/06/21 15:21:01 ui: 2024-06-21T15:21:01-04:00: amazon-ebs.builder: 15 2024/06/21 15:21:01 ui: 2024-06-21T15:21:01-04:00: amazon-ebs.builder: System shutdown time has arrived 16 2024/06/21 15:21:01 ui: 2024-06-21T15:21:01-04:00: ==> amazon-ebs.builder: Pausing 1m0s after this provisioner... 17 2024/06/21 15:22:01 [INFO] (telemetry) ending shell 18 2024/06/21 15:22:01 [INFO] (telemetry) Starting provisioner shell 19 2024/06/21 15:22:01 ui: 2024-06-21T15:22:01-04:00: ==> amazon-ebs.builder: Provisioning with shell script: ./stage2.setup_ami.sh 20 2024/06/21 15:22:01 packer-provisioner-shell plugin: Opening ./stage2.setup_ami.sh for reading 21 2024/06/21 15:22:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:22:01 [DEBUG] Opening new ssh session 22 2024/06/21 15:22:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:22:01 [ERROR] ssh session open error: 'EOF', attempting reconnect 23 2024/06/21 15:22:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:22:01 [DEBUG] reconnecting to TCP connection for SSH 24 2024/06/21 15:22:01 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData' 25 2024/06/21 15:22:01 [INFO] 18451 bytes written for 'uploadData' 26 2024/06/21 15:22:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:22:01 [DEBUG] handshaking with SSH 27 2024/06/21 15:23:01 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake 28 2024/06/21 15:23:03 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:23:03 [DEBUG] Opening new ssh session 29 2024/06/21 15:23:03 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:23:03 [ERROR] ssh session open error: 'client not available', attempting reconnect 30 2024/06/21 15:23:03 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:23:03 [DEBUG] reconnecting to TCP connection for SSH 31 2024/06/21 15:23:03 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:23:03 [DEBUG] handshaking with SSH 32 2024/06/21 15:23:03 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData' 33 2024/06/21 15:23:03 [INFO] 18451 bytes written for 'uploadData' 34 2024/06/21 15:24:03 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake 35 2024/06/21 15:24:05 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:24:05 [DEBUG] Opening new ssh session 36 2024/06/21 15:24:05 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:24:05 [ERROR] ssh session open error: 'client not available', attempting reconnect 37 2024/06/21 15:24:05 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:24:05 [DEBUG] reconnecting to TCP connection for SSH 38 2024/06/21 15:24:05 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:24:05 [DEBUG] handshaking with SSH 39 2024/06/21 15:24:05 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData' 40 2024/06/21 15:24:05 [INFO] 18451 bytes written for 'uploadData' 41 2024/06/21 15:25:05 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake 42 2024/06/21 15:25:07 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:25:07 [DEBUG] Opening new ssh session 43 2024/06/21 15:25:07 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:25:07 [ERROR] ssh session open error: 'client not available', attempting reconnect 44 2024/06/21 15:25:07 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:25:07 [DEBUG] reconnecting to TCP connection for SSH 45 2024/06/21 15:25:07 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData' 46 2024/06/21 15:25:07 [INFO] 18451 bytes written for 'uploadData' 47 2024/06/21 15:25:07 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:25:07 [DEBUG] handshaking with SSH 48 2024/06/21 15:26:07 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake 49 2024/06/21 15:26:09 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:26:09 [DEBUG] Opening new ssh session 50 2024/06/21 15:26:09 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:26:09 [ERROR] ssh session open error: 'client not available', attempting reconnect 51 2024/06/21 15:26:09 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:26:09 [DEBUG] reconnecting to TCP connection for SSH 52 2024/06/21 15:26:09 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData' 53 2024/06/21 15:26:09 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:26:09 [DEBUG] handshaking with SSH 54 2024/06/21 15:26:09 [INFO] 18451 bytes written for 'uploadData' 55 2024/06/21 15:27:09 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake 56 2024/06/21 15:27:09 [INFO] (telemetry) ending shell 57 2024/06/21 15:27:09 ui error: 2024-06-21T15:27:09-04:00: ==> amazon-ebs.builder: Error uploading script: Timeout during SSH handshake 58 2024/06/21 15:27:09 ui: 2024-06-21T15:27:09-04:00: ==> amazon-ebs.builder: Step "StepProvision" failed 59 2024/06/21 15:27:09 ui: ask: ==> amazon-ebs.builder: [c] Clean up and exit, [a] abort without cleanup, or [r] retry step (build may fail even if retry succeeds)? 60 2024/06/21 15:31:33 ui: 2024-06-21T15:31:33-04:00: ==> amazon-ebs.builder: Provisioning step had errors: Running the cleanup provisioner, if present... 61 2024/06/21 15:31:33 ui: 2024-06-21T15:31:33-04:00: ==> amazon-ebs.builder: Terminating the source AWS instance... 62 2024/06/21 15:31:33 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:31:33 ssm: Terminating PortForwarding session "user.XXXXXXXX" 63 2024/06/21 15:40:24 ui: 2024-06-21T15:40:24-04:00: ==> amazon-ebs.builder: Cleaning up any extra volumes... 64 2024/06/21 15:40:25 ui: 2024-06-21T15:40:25-04:00: ==> amazon-ebs.builder: No volumes to clean up, skipping 65 2024/06/21 15:40:25 ui: 2024-06-21T15:40:25-04:00: ==> amazon-ebs.builder: Deleting temporary security group... 66 2024/06/21 15:40:26 [INFO] (telemetry) ending amazon-ebs.builder 67 2024/06/21 15:40:26 ui error: 2024-06-21T15:40:26-04:00: Build 'amazon-ebs.builder' errored after 28 minutes 18 seconds: Error uploading script: Timeout during SSH handshake 68 2024/06/21 15:40:26 ui: 69 ==> Wait completed after 28 minutes 18 seconds 70 2024/06/21 15:40:26 machine readable: error-count []string{"1"} 71 2024/06/21 15:40:26 ui error: 72 ==> Some builds didn't complete successfully and had errors: 73 2024/06/21 15:40:26 machine readable: amazon-ebs.builder,error []string{"Error uploading script: Timeout during SSH handshake"} 74 2024/06/21 15:40:26 ui error: --> amazon-ebs.builder: Error uploading script: Timeout during SSH handshake 75 2024/06/21 15:40:26 ui: 76 ==> Builds finished but no artifacts were created. ```

I was using the following settings both times:

  ssh_read_write_timeout = "3m"

  provisioner "shell" {
    script            = "./stage2.setup_ami.sh"
    execute_command   = "sudo {{ .Path }} reboot"
    expect_disconnect = true
    skip_clean        = true
    pause_after       = "1m"
  }

  provisioner "shell" {
    script            = "./stage2.setup_ami.sh"
    execute_command   = "sudo {{ .Path }} setup2"
    max_retries       = 5
  }

I think what we need is perhaps a new option force_disconnect instead of expect_disconnect. At least in the case of using the aws session-manager-plugin.

Edit to add: This works reliably for me with the aws session-manager-plugin:

  provisioner "shell" {
    script            = "./stage2.setup_ami.sh"
    execute_command   = "sudo {{ .Path }} reboot"
    expect_disconnect = true
    skip_clean        = true
  }

  # Force kill the session-manager-plugin since it doesn't always notice the
  # remote end going away. Packer will restart it. This seems to be the only
  # reliable way to handle reboots.
  provisioner "shell-local" {
    inline = ["pkill -g 0 session-manager-plugin"]
  }

  provisioner "shell" {
    pause_before = "10s"
    inline       = ["uptime"]
    max_retries  = 10
  }
Output looks like: ``` 2024-06-21T17:18:44-04:00: amazon-ebs.builder: shutdown: [pid 1007] 2024-06-21T17:18:44-04:00: amazon-ebs.builder: Shutdown at Fri Jun 21 21:18:45 2024. 2024-06-21T17:18:44-04:00: amazon-ebs.builder: shutdown: can't detach from console 2024-06-21T17:18:45-04:00: amazon-ebs.builder: Shutdown at Fri Jun 21 21:18:45 2024. 2024-06-21T17:18:45-04:00: amazon-ebs.builder: 2024-06-21T17:18:45-04:00: amazon-ebs.builder: System shutdown time has arrived 2024-06-21T17:18:46-04:00: ==> amazon-ebs.builder: Running local shell script: /tmp/packer-shell1139810214 2024-06-21T17:18:46-04:00: ==> amazon-ebs.builder: Pausing 10s before the next provisioner... 2024-06-21T17:18:46-04:00: ==> amazon-ebs.builder: Bad exit status: -1 2024-06-21T17:18:56-04:00: ==> amazon-ebs.builder: Provisioning with shell script: /tmp/packer-shell2636857501 2024-06-21T17:19:42-04:00: amazon-ebs.builder: Starting portForwarding session "user.XXXXXXXX". 2024-06-21T17:19:42-04:00: amazon-ebs.builder: Starting session with SessionId: user.XXXXXXXX 2024-06-21T17:19:46-04:00: amazon-ebs.builder: Port 8116 opened for sessionId user.XXXXXXXX. 2024-06-21T17:19:46-04:00: amazon-ebs.builder: Waiting for connections... 2024-06-21T17:19:48-04:00: amazon-ebs.builder: Connection accepted for session [user.XXXXXXXX] 2024-06-21T17:19:52-04:00: amazon-ebs.builder: 21:19 up 48 secs, 1 user, load averages: 5.13 1.37 0.51 ```

That's obviously a hack. The amazon plugin specifically needs better ssm handling, but generally, what I think is needed is a way to tell packer that the remote machine is positively going away between steps and it should do whatever it has to to drop and re-establish the connection.