hashicorp / packer-plugin-googlecompute

Packer plugin for Google Compute Builder
https://www.packer.io/docs/builders/googlecompute
Mozilla Public License 2.0
23 stars 53 forks source link

Default behavior of build variable `-on-error=cleanup` is not being respected #111

Open schmidtd opened 1 year ago

schmidtd commented 1 year ago

Overview of the Issue

The documented default behavior of the -on-error variable is cleanup. That means that on failure of a packer build, the VM and disk that is created should also be deleted. However, without specifying anything for the -on-error variable in your build, a failure will keep the VM running and disk resource allocated. This leads to resource leakage in the default case, which is in conflict with the documentation:

website/content/docs/commands/build.mdx:- `-on-error=cleanup` (default), `-on-error=abort`, `-on-error=ask`, `-on-error=run-cleanup-provisioner` -

It seems the default behavior is actually abort. I am running the plugin as part of a GitHub action pipeline.

Reproduction Steps

Create a build that errors out. Any error will do that causes the build step to fail without creating any resources (i.e. see the message ==> Builds finished but no artifacts were created.)

Plugin and Packer version

Installed plugin github.com/hashicorp/googlecompute v1.0.15 in "/github/home/.config/packer/plugins/github.com/hashicorp/googlecompute/packer-plugin-googlecompute_v1.0.15_x5.0_linux_amd64"

Operating system and Environment details

My GitHub actions specify unbuntu:

Current runner version: '2.295.0'
Operating System
  Ubuntu
  20.04.4
  LTS
Runner Image
  Image: ubuntu-20.04
  Version: 20220821.1
  Included Software: https://github.com/actions/runner-images/blob/ubuntu20/20220821.1/images/linux/Ubuntu2004-Readme.md
  Image Release: https://github.com/actions/runner-images/releases/tag/ubuntu20%2F20220821.1

Log Fragments and crash.log files

Here is the log if you don't specify -on-error=cleanup:

==> aoc-gcp.googlecompute.aoc-gcp: Error: initializing source docker://nogood.redhat.io/rhel8/redis-6@sha256:d7c7852338717308cbb59e9303e1ea35cc8e5c01ceb2818569be20c15f3f943d: pinging container registry nogood.redhat.io: Get "https://nogood.redhat.io/v2/": dial tcp: lookup nogood.redhat.io on 169.254.169.254:53: no such host
==> aoc-gcp.googlecompute.aoc-gcp: Script exited with non-zero exit status: 125. Allowed exit codes are: [0]
==> aoc-gcp.googlecompute.aoc-gcp: Step "StepProvision" failed, aborting...
==> aoc-gcp.googlecompute.aoc-gcp: aborted: skipping cleanup of step "StepConnect"
==> aoc-gcp.googlecompute.aoc-gcp: aborted: skipping cleanup of step "StepStartTunnel"
==> aoc-gcp.googlecompute.aoc-gcp: aborted: skipping cleanup of step "StepInstanceInfo"
==> aoc-gcp.googlecompute.aoc-gcp: aborted: skipping cleanup of step "StepCreateWindowsPassword"
==> aoc-gcp.googlecompute.aoc-gcp: aborted: skipping cleanup of step "StepCreateInstance"
==> aoc-gcp.googlecompute.aoc-gcp: aborted: skipping cleanup of step "StepImportOSLoginSSHKey"
==> aoc-gcp.googlecompute.aoc-gcp: aborted: skipping cleanup of step "nullStep"
==> aoc-gcp.googlecompute.aoc-gcp: aborted: skipping cleanup of step "StepSSHKeyGen"
==> aoc-gcp.googlecompute.aoc-gcp: aborted: skipping cleanup of step "StepCheckExistingImage"
Build 'aoc-gcp.googlecompute.aoc-gcp' errored after 9 minutes 13 seconds: Script exited with non-zero exit status: 125. Allowed exit codes are: [0]

Here is the log if you explicitly specify -on-error=cleanup:

==> aoc-gcp.googlecompute.aoc-gcp: Error: initializing source docker://nogood.redhat.io/rhel8/redis-6@sha256:d7c7852338717308cbb59e9303e1ea35cc8e5c01ceb2818569be20c15f3f943d: pinging container registry nogood.redhat.io: Get "https://nogood.redhat.io/v2/": dial tcp: lookup nogood.redhat.io on 169.254.169.254:53: no such host
==> aoc-gcp.googlecompute.aoc-gcp: Provisioning step had errors: Running the cleanup provisioner, if present...
==> aoc-gcp.googlecompute.aoc-gcp: Deleting instance...
    aoc-gcp.googlecompute.aoc-gcp: Instance has been deleted!
==> aoc-gcp.googlecompute.aoc-gcp: Deleting disk...
    aoc-gcp.googlecompute.aoc-gcp: Disk has been deleted!
Build 'aoc-gcp.googlecompute.aoc-gcp' errored after 8 minutes 56 seconds: Script exited with non-zero exit status: 125. Allowed exit codes are: [0]
spencer-cdw commented 1 year ago

We are also seeing this behavior on 1.8.6. It appears that the default behavior is abort. Images are not cleaned up on error.

nywilken commented 1 year ago

@spencer-cdw @schmidtd do you have a working configuration that I can use to reproduce this issue?

I am unable to reproduce with the latest version of the Google Compute plugin and Packer. By default I can see that the OnError flag is unset and that is what is being passed to the SDK, which is what controls the build runner. When onError is unset it defaults to the cleanup step. Locally I see the cleanup happening as expected; setting the on-error=abort aborts the steps as specified.

Sample template with error in the provisioner: https://gist.github.com/nywilken/3f2276c5b202fa42ff14ab8723a5998a I installed the latest version of the plugin using the command below.

~>  packer plugins install github.com/hashicorp/googlecompute
Installed plugin github.com/hashicorp/googlecompute v1.1.1 in "/Users/dev/.packer.d/plugins/github.com/hashicorp/googlecompute/packer-plugin-googlecompute_v1.1.1_x5.0_darwin_arm64"

Is there any other information that you can provide to help figure out what might be happening here. If possible a full debug log might provide some insight into what is being passed.

schmidtd commented 1 year ago

I did have one when the issue was written, but now we always explicitly set to cleanup in all cases. If it's working now to actually clean up when unset, it looks like the fix was in the SDK.