hashicorp / packer-plugin-googlecompute

Packer plugin for Google Compute Builder
https://www.packer.io/docs/builders/googlecompute
Mozilla Public License 2.0
23 stars 53 forks source link

IAP tunnel process never stops if cleanup is aborted #95

Closed natedogith1 closed 2 years ago

natedogith1 commented 2 years ago

Overview of the Issue

When -on-error=abort is used with packer build (so that the VM can be inspected if something fails) and a provisioning error occurs, the IAP tunnel process never stops.

Reproduction Steps

  1. Create a build file with source googlecompute and a provisioner config that will fail
  2. run "packer build -on-error=abort /path/to/build/file"
  3. wait for build to fail and abort
  4. run "ps -ef | grep gcloud-setup"

Plugin and Packer version

Packer v1.7.5

Operating system and Environment details

Running from gitlab on a Centos 8 gitlab runner.

Log Fragments and crash.log files

2022-05-26T17:41:49Z: ==> googlecompute.image-source: Error executing Ansible: Non-zero exit status: exit status 2
2022-05-26T17:41:49Z: ==> googlecompute.image-source: Step "StepProvision" failed, aborting...
2022-05-26T17:41:49Z: ==> googlecompute.image-source: aborted: skipping cleanup of step "StepConnect"
2022-05-26T17:41:49Z: ==> googlecompute.image-source: aborted: skipping cleanup of step "StepStartTunnel"
2022-05-26T17:41:49Z: ==> googlecompute.image-source: aborted: skipping cleanup of step "StepInstanceInfo"
2022-05-26T17:41:49Z: ==> googlecompute.image-source: aborted: skipping cleanup of step "StepCreateWindowsPassword"
2022-05-26T17:41:49Z: ==> googlecompute.image-source: aborted: skipping cleanup of step "StepCreateInstance"
2022-05-26T17:41:49Z: ==> googlecompute.image-source: aborted: skipping cleanup of step "StepImportOSLoginSSHKey"
2022-05-26T17:41:49Z: ==> googlecompute.image-source: aborted: skipping cleanup of step "nullStep"
2022-05-26T17:41:49Z: ==> googlecompute.image-source: aborted: skipping cleanup of step "StepSSHKeyGen"
2022-05-26T17:41:49Z: ==> googlecompute.image-source: aborted: skipping cleanup of step "StepCheckExistingImage"
2022-05-26T17:41:49Z: Build 'googlecompute.image-source' errored after 17 minutes 25 seconds: Error executing Ansible: Non-zero exit status: exit status 2
==> Wait completed after 17 minutes 25 seconds
==> Some builds didn't complete successfully and had errors:
2022/05/26 17:41:49 [INFO] (telemetry) ending googlecompute.image-source
==> Wait completed after 17 minutes 25 seconds
2022/05/26 17:41:49 machine readable: error-count []string{"1"}
==> Some builds didn't complete successfully and had errors:
2022/05/26 17:41:49 machine readable: googlecompute.image-source,error []string{"Error executing Ansible: Non-zero exit status: exit status 2"}
==> Builds finished but no artifacts were created.
--> googlecompute.image-source: Error executing Ansible: Non-zero exit status: exit status 2
==> Builds finished but no artifacts were created.
2022/05/26 17:41:49 [INFO] (telemetry) Finalizing.
2022/05/26 17:41:50 waiting for all plugin processes to complete...
2022/05/26 17:41:50 .packer.d/.packer.d/plugins/github.com/hashicorp/googlecompute/packer-plugin-googlecompute_v1.0.13_x5.0_linux_amd64: plugin process exited
2022/05/26 17:41:50 /usr/bin/packer: plugin process exited

Followed by a wait until the pipeline is timed out or someone logs in and manually kills the tunnel process.

nywilken commented 2 years ago

Hi @natedogith1 thanks for reaching out. The behavior you are describing is exactly how abort is meant to work. In a normal build flow (either failure or success) there is a cleanup step that is responsible for tearing down any sub processes started during a build - including the termination of an IAP tunnel.

However since Packer is being told to abort on error, Packer will exit immediately and not run any cleanup steps. Hence why the IAP tunnel and any other resources created during the build are not terminated.

Is there a reason why you chose --on-error=abort over --on-error=ask?

If you wish to investigate a failed provisioner you could instead pass --on-error=ask to the build command. Upon failure Packer will prompt you for instructions on how to continue. During this time you could connect to the running instance to troubleshoot. Then when ready you can retry, abort, or just cleanup to exit properly.

natedogith1 commented 2 years ago

This is being run on a CI/CD pipeline and --on-error=ask require interaction.

nywilken commented 2 years ago

Hi @natedogith1 thanks for that extra info. At this time I have no work around other than using an external script to find the PID associated with the tunnel and terminating it from within the script. By hard aborting Packer all cleanup processes are skipped.

At this time the IAP does not seem to offer the ability to set a no-activity timeout that would automatically terminate the tunnel. According to the Google documentation the tunnel will terminate itself after 1 hour.

I'm going to close this issue as there is no solution for cleaning up once a build is aborted. But I'll apply the track-internal label so that we can revisit the Google SDK/CLI in the future if things should change.

natedogith1 commented 1 year ago

I have multiple IAP tunnel processes that have been running for days without a connection and with the VM deleted.