hashicorp / packer-plugin-azure

Packer plugin for Azure Virtual Machine Image Builders
https://www.packer.io/docs/builders/azure
Mozilla Public License 2.0
51 stars 80 forks source link

ARM: Improve validation of `direct_shared_gallery_image_id` and `community_gallery_image_id` and make include Microsoft links in plugin docs #424

Open hc-github-team-packer opened 3 months ago

hc-github-team-packer commented 3 months ago

This issue was originally opened by @sabuncumurat in https://github.com/hashicorp/packer/issues/13036 and has been migrated to this repository. The original issue description is below.


I have the following pkr.hcl file:

source "azure-arm" "from-sig" {

  build_resource_group_name = "packer-build"

  shared_image_gallery {
    subscription        = "<subscription-id>"
    resource_group      = "MyImages"
    gallery_name        = "MyGallery"
    image_name          = "pkrgenimg"
    image_version       = "2.2.3"
  }

  shared_image_gallery_destination {
    subscription        = "<subscription-id>"
    resource_group      = "MyImages"
    gallery_name        = "MyGallery"
    image_name          = "pkrgenimg"
    image_version       = "6.4.0"
    target_region {
      name = "northeurope"
    }

    ...
  }

The above works and version 6.4.0 of the image in question is generated successfully from version 2.2.3, as intended. All good.

When I replace image_name and image_version in the shared_image_gallery block with either community_gallery_image_id or direct_shared_gallery_image_id, the image generation fails:

  shared_image_gallery {
    subscription        = "<subscription-id>"
    resource_group      = "MyImages"
    direct_shared_gallery_image_id = "/subscriptions/<subscription-id>/resourceGroups/MyImages/providers/Microsoft.Compute/galleries/MyGallery/images/pkrgenimg/versions/2.2.3"
  }

PowerShell session error:

==> azure-arm.from-sig: ERROR: -> DeploymentFailed : At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details. ==> azure-arm.from-sig: ERROR: -> BadRequest : Id /subscriptions/\<subscription-id>/resourceGroups/MyImages/providers/Microsoft.Compute/galleries//images/ is not a valid resource reference.

Please note that in the second line of the error above, the Id string after BadRequest is missing components after the galleries part.

If this is not a bug, what is the proper way to use an image_id in a shared_image_gallery block?

And a related question - when the above image build fails, three artifacts are left behind and not cleaned up:

image

I have to remove them manually.

I am under the impression that Packer is supposed to clean up even in case of an error. Is this really the case?

Thank you.

JenGoldstrich commented 3 months ago

Hey @sabuncumurat,

So the issue you're running here is that Direct Shared Image Gallery's and Community Gallery's are distinct types of shared images which can not be referenced using the Azure resource ID. A Direct Shared Image Gallery is a preview feature https://learn.microsoft.com/en-us/azure/virtual-machines/share-gallery-direct?tabs=portaldirect that shares images across tenants and subscriptions, as such its ID would not be a standard Azure resource ID. You can find examples of each of these IDs here https://developer.hashicorp.com/packer/integrations/hashicorp/azure/latest/components/builder/arm but for direct shared galleries it would look something like this /sharedGalleries/{galleryUniqueName}/Images/{img}[/Versions/{}

A community gallery would also have a different resource ID and would not be referenced by the subscription ID https://learn.microsoft.com/en-us/azure/virtual-machines/share-gallery-community?tabs=cli and is shared with the wider Azure community I believe

However there is also some inconsistent logic in the code base around this feature, and some poorly defined docs, so here's a few ways I will plan to address this

1.) We shouldn't allow setting a subscription id or resource group name in the shared_image_gallery block when using a direct SIG ID or a community ID, as those resources are shared across subscriptions and resource group, this caused a code issue https://github.com/hashicorp/packer-plugin-azure/blob/1b209f585b32e3a4b9fbe5b990e35d9892dd3e92/builder/azure/arm/template_factory.go#L175-L189 here, where we check if the subscription is set first and do the "normal" SIG image source

2.) We should print the source name of the direct or community ID gallery, rather than printing an invalid source name in the step_get_source_image_name

3.) We should update the docs to point to the Microsoft docs that describe these features, and make it clearer than this is a different type of image. So that users can more clearly understand that these are different images than normal compute gallery image versions.

With regards to your orphaned resources this is a bit tricky to diagnose without your full build logs, sometimes when VM Deployments fail Azure won't let the network resources be closed until the VM's nic reservation fully times out, which can lead to these resources, I can comment more on this if you

sabuncumurat commented 3 months ago

@JenGoldstrich First off, thank you for your response.

I reran the build to repro the orphaned-resources problem. The same set of resources were left behind (pip, nic and vnet). I am attaching the log file packerlog.txt:

packerlog.txt

I will comment separately about the other points in your response.

JenGoldstrich commented 3 months ago

The cause of the orphaned resources is this error

==> azure-arm.from-sig: Error: retry count exhausted. Last err: performing Delete: unexpected status 400 (400 Bad Request) with error: NicReservedForAnotherVm: Nic(s) in request is reserved for another Virtual Machine for 180 seconds. Please provide another nic(s) or retry after 180 seconds. Reserved VM: /subscriptions/7ec6293f-2b0e-481d-9aae-0943d6b8f698/resourceGroups/packer-build/providers/Microsoft.Compute/virtualMachines/pkrvm59gjy6cu9r
2024/06/18 11:19:27 ui error: ==> azure-arm.from-sig: retry count exhausted. Last err: perform

Since the VM was never created we can't delete it, but Azure will refuse to delete the nic for 180 seconds later, which is far past our logic for retrying to delete resources, if you don't set a build_resource_group_name, the resource group itself is deleted which I believe bypasses this error, but I understand many organizations require using an existing resource group for build. In general I wouldn't wanna delay it to 3 minutes as this would make everyone failed builds longer, but it may be possible to catch this specific error, although I am not sure its as important to fix as the other issues mentioned in my previous message

sabuncumurat commented 3 months ago

Thanks for the explanation. Based on your response I tried this: Modified my template and changed build_resource_group_name to temp_resource_group_name.

Unfortunately, I got this error:

Specify either a location to create the resource group in or an existing build_resource_group_name, but not both.

I don't have a location specified in my template, so the above error makes no sense.

I then removed temp_resource_group_name and specified location.

Happy to report that worked, and a temporary RG was created and following the errors (from earlier) everything (pip, nic, and vnet) was deleted. (The delete took a really long time, and it was totally silent - there were no multiple 'trying...' attempt outputs.)

Thank you sooo much for your guidance. Huge learning experience for me.

One additional curiosity if I may: I searched both the packer main repo and this repo to look for the error string 'Specify either a location to create the resource group in or an existing build_resource_group_name, but not both' but came up empty-handed. Why is this? Thanks again.

JenGoldstrich commented 3 months ago

:] Of course, happy to help! So this error occurs whenever the following condition is true

!xor(location != "", build_resource_group_name != "")

https://github.com/hashicorp/packer-plugin-azure/blob/1b209f585b32e3a4b9fbe5b990e35d9892dd3e92/builder/azure/arm/config.go#L1298-L1300 we throw this error based on an exclusive or (xor) operator on build_resource_group_name being unset, or a build location being unset. An improvement could be to throw a different error here based on when both are set which is invalid as the location is used to chose where to create the build resource group, and a different error when neither are set. Even just removing "but not both" would probably make this more accurate