juanjosevazquezgil commented 1 month ago

Motivation

We found a bug recently in one of our clients. They have the following setup:

a claims repo, using firestartr
a Virtual Machine Scale Set in Azure, which uses a custom extension script to setup and start its VMs. Both are managed by Terraform

The error happened when following these steps:

Create a PR where the custom extension script of a VMSS is updated, so it looks for and downloads a non-existing image
Commit that change, apply it with Terraform and upload it to Azure. The VMSS should fail to start
Create another PR where the previous error is fixed
Commit that change, apply it with Terraform and upload it to Azure. The VMSS should still fail to start, with the same error as step 2
To fix this, you must manually upgrade each VM in the VMSS

It seems like Terraform/Azure only upgrades VMs when no error is present in its custom extension scripts (at least if that script is used to start the VM). We need to prove this is the case, investigate why it happens and how to fix it

Acceptance criteria

[x] The cause of the problem is identified and can be reproduced
[x] We know why the problem happens
[x] A fix is discussed and applied, if possible

juanjosevazquezgil commented 1 month ago

Confirmed: the steps described in this issue can be used to reproduce the error

alambike commented 2 weeks ago

Solved changing the upgrade mode from Manual (default) to Automatic:

https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/linux_virtual_machine_scale_set#upgrade_mode

prefapp / tfm

[BUG]: Terraform sometimes doesn't automatically upgrade a VMSS instance when updating a custom extension script #119

Motivation

Acceptance criteria