hashicorp / packer-plugin-amazon

Packer plugin for Amazon AMI Builder
https://www.packer.io/docs/builders/amazon
Mozilla Public License 2.0
76 stars 112 forks source link

Session Manager connection fail during reboot #289

Closed henti closed 1 year ago

henti commented 2 years ago

Overview of the Issue

I'm attempting to create an AMI in Amazon EBS that involves a reboot of the mac.metal instance. I'm attempting to use the pause_before after rebooting in shell in the step above. I'm also setting expect_disconnect to true

This unfortunately fails with Connection to destination port failed, check SSM Agent logs. whent he instance reboots and the SSM agent is no longer connected.

I had hoped having "expect_disconnect": true in the step above would trigger a re-connection attempt for SSM until pause_before is reached and only then fail.

Reproduction Steps

Create instance on private network using SSM to connect.

Plugin and Packer version

From packer version

Packer v1.8.4

Simplified Packer Buildfile

        {
            "type": "shell",
            "execute_command": "chmod +x {{ .Path }}; sudo {{ .Vars }} {{ .Path }}",
            "script": "./reboot.sh",
            "expect_disconnect": true
        },
        {
            "type": "shell",
            "execute_command": "chmod +x {{ .Path }}; {{ .Vars }} {{ .Path }}",
            "pause_before": "900s",
            "scripts": [
                "/some/path/here/to/script.sh"
            ]
        }

Operating system and Environment details

Instance is mac2.metal host running packer is Ubuntu Focal.

Log Fragments and crash.log files

==> amazon-ebs: Provisioning with shell script: ./provision/core/reboot.sh
    amazon-ebs: Shutdown NOW!
    amazon-ebs:
    amazon-ebs: System shutdown time has arrived
    amazon-ebs: Connection to destination port failed, check SSM Agent logs.
    amazon-ebs: Connection to destination port failed, check SSM Agent logs.
    amazon-ebs: Connection to destination port failed, check SSM Agent logs.
    amazon-ebs: Connection to destination port failed, check SSM Agent logs.
==> amazon-ebs: Provisioning step had errors: Running the cleanup provisioner, if present...
==> amazon-ebs: Terminating the source AWS instance...
henti commented 2 years ago

As per https://github.com/aws/session-manager-plugin/issues/55:

We do not support client side reconnection after instance reboot.
Maybe after instance reboot, you can check it with the get-connection-status API to check is it connected back to session manager service side, then start a new session to connect it.
https://docs.aws.amazon.com/cli/latest/reference/ssm/get-connection-status.html

It would makes sense for expect_disconnect coupled with "ssh_interface": "session_manager" packer should check when the instance it connected back to session manager service before reconnecting

Glyphack commented 1 year ago

I'll give this a try. Recently worked on something else involving SSM.

lbajolet-hashicorp commented 1 year ago

Hi @henti,

Thanks for bringing this up.

Out of curiosity, have you experienced it with other instance types? Or is it something you've only seen with mac2.metal instances? Also, is it something you can reliably reproduce, or is it something that seldom happens?

For reference, SSM sessions should resume on their own if the current one gets interrupted, and this is the behaviour I've seen while testing #311, so I don't know where in the process this fails to re-establish the SSM connection, I'll probably need your help figuring out how to reproduce this error so we can come up with a fix. I saw the excerpt from a template you shared in this thread, could you provide a more complete template that we can run with? This would be very helpful.

Thanks in advance!

henti commented 1 year ago

Hi @lbajolet-hashicorp

I've only tested this on mac2.metal types as those are the only types I'm working with on my project.

This might be related to reboot times on AWS, as it used to be high (more than 5 minutes) but seems to have been fixed (now less than a minute)

Unfortunately I moved my testing setup to public IP away from SSM to progress my work, but I'll see if I can get my old environment up and running again to test with.

lbajolet-hashicorp commented 1 year ago

Hi @henti,

No problem, if you are able to reproduce this problem please let us know, but please don't feel obligated to rush on this, if this is fixed we should be good in any case.

I'll leave this issue open, and if we can't reproduce the problem anymore, we can close this later.

Thanks for the prompt response!

henti commented 1 year ago

I moved back to SSM to test and the reboot times is now much quicker and this problem no longer happens on my tests.

lbajolet-hashicorp commented 1 year ago

Hi @henti,

Great to hear! I'll close this issue then if you don't experience the problem anymore. Feel free to reopen if you do encounter the problem again