opentelekomcloud / terraform-provider-opentelekomcloud

Terraform OpenTelekomCloud provider
https://registry.terraform.io/providers/opentelekomcloud/opentelekomcloud/latest
Mozilla Public License 2.0
85 stars 77 forks source link

AS config image ID change bug #2306

Closed valkaitibor closed 11 months ago

valkaitibor commented 1 year ago

Terraform provider version

Terraform v1.5.3 on linux_amd64

Affected Resource(s)

Terraform Configuration Files

resource "opentelekomcloud_as_configuration_v1" "autoscaling_config" { scaling_configuration_name = "autoscaling_config" instance_config { flavor = var.autoscalingFlavor image = var.imageID disk { size = 20 volume_type = "SATA" disk_type = "SYS" } key_name = var.keyName } }

resource "opentelekomcloud_as_group_v1" "autoscaling_group" { scaling_group_name = "autoscaling_group" scaling_configuration_id = opentelekomcloud_as_configuration_v1.autoscaling_config.id desire_instance_number = var.desireInstanceNumber min_instance_number = var.minInstanceNumber max_instance_number = var.maxInstanceNumber networks { id = var.subnet_id } security_groups { id = var.secgroup_id } vpc_id = var.vpc_id delete_publicip = true delete_instances = "yes" lbaas_listeners { pool_id = opentelekomcloud_lb_pool_v2.pool_https.id protocol_port = "80" } }

Debug Output/Panic Output

https://gist.github.com/valkaitibor/fad75881d75a6709c96b701bec8814f4

Steps to Reproduce

  1. create the AS group and AS config with terraform apply
  2. then change the imageID variable
  3. use terraform apply again

Expected Behavior

The ECS created by the original AS group should be destroyed, the image ID in the AS config should be changed, then a new ECS with the new image should be created. If necessary, the connection between the AS config and AS group should be terminated for the time when the AS config is modificated.

OR ANOTHER SOLUTION

The AS group and the AS config with the created ECS should be destroyed, and all should be created again with the new image ID.

Actual Behavior

I am getting an error message saying that the AS config cannot be modified because it is attached to an AS group. However, for example, if there is a loadbalancer and a listener connected, and I want to change the loadbalancer, terraform can make the change, disconnect the loadbalancer and the listener, destroys the loadbalancer and redeploys it with the new configuration, and attaches the listener to the new loadbalancer. This should happen in this case too.

Important Factoids

References

canaykin commented 1 year ago

Hi @valkaitibor , This can currently be solved on the terraform code by setting the lifecycle rule on the as_configuration resource:

  lifecycle {
    create_before_destroy = true
  }

That being said, I agree that it would be cleaner to have this as the default behavior of the provider.

Best, Can.

anton-sidelnikov commented 1 year ago

@valkaitibor The best solution was provided by @canaykin. The main problem here that as configuration API doesn't support update, and in that case we need to figure out what group are attached, then create new configuration and substitute old with new, and this brings too much dependencies in code

valkaitibor commented 1 year ago

Hello @anton-sidelnikov and @canaykin,

Thank you so much, your idea helped me to change the image ID in the AS config. However, the ECSs created by the AS group were not changed: they still exist with the old image ID. How can I tell to the AS group to destroy those ECSs and create new ones instead with the new image ID, when the image ID is changed?

canaykin commented 1 year ago

Hi @valkaitibor ,

This is the expected behavior of OTC AS and a config change will only affect newly provisioned nodes in the group. That being said, a blue-green deployment can be achieved in the following manner if the AS group is using instance_terminate_policy="OLD_CONFIG_OLD_INSTANCE".

The idea is to first double the number of nodes and then halve it using scaling so that the AS group ends up with the exact amount of nodes as the starting conditions but all nodes are using the new configuration. Since at any given time, the total number of nodes is always greater or equal to the number of nodes at the start, if done correctly, this method can deploy a new set of nodes without downtime or performance impact.

I cannot really say if this code still works but I have created something similar in the past for a project where I was able to automate this process with the terraform using scheduled scaling policies:

resource "time_offset" "deployer_scale_up" {
  count          = var.node_autodeploy ? 1 : 0
  offset_seconds = var.deployer_scale_up_delay
  triggers = {
    scaling_configuration_id = opentelekomcloud_as_configuration_v1.node_cluster_config.id
  }
  lifecycle {
    ignore_changes = [offset_seconds]
  }
}

resource "time_offset" "deployer_scale_down" {
  count          = var.node_autodeploy ? 1 : 0
  offset_seconds = var.deployer_scale_down_delay
  triggers = {
    scaling_configuration_id = opentelekomcloud_as_configuration_v1.node_cluster_config.id
  }
  lifecycle {
    ignore_changes = [offset_seconds]
  }
}

resource "time_sleep" "deployer_lock" {
  count           = var.node_autodeploy ? 1 : 0
  create_duration = var.deployer_lock_duration
  triggers = {
    scaling_configuration_id = opentelekomcloud_as_configuration_v1.node_cluster_config.id
  }
  lifecycle {
    ignore_changes = [create_duration]
  }
}

resource "opentelekomcloud_as_policy_v2" "deployer_scale_up" {
  count                 = var.node_autodeploy ? 1 : 0
  scaling_policy_name   = "${var.prefix}-deployer-scale-up"
  scaling_policy_type   = "SCHEDULED"
  scaling_resource_id   = opentelekomcloud_as_group_v1.node_cluster_asgroup.id
  scaling_resource_type = "SCALING_GROUP"
  cool_down_time        = 1

  scaling_policy_action {
    operation  = "SET"
    percentage = 200
  }
  scheduled_policy {
    launch_time = formatdate("YYYY-MM-DD'T'hh:mmZ", time_offset.deployer_scale_up[0].rfc3339)
  }
}

resource "opentelekomcloud_as_policy_v2" "deployer_scale_down" {
  count                 = var.node_autodeploy ? 1 : 0
  scaling_policy_name   = "${var.prefix}-deployer-scale-down"
  scaling_policy_type   = "SCHEDULED"
  scaling_resource_id   = opentelekomcloud_as_group_v1.node_cluster_asgroup.id
  scaling_resource_type = "SCALING_GROUP"
  cool_down_time        = 1

  scaling_policy_action {
    operation  = "SET"
    percentage = 50
  }
  scheduled_policy {
    launch_time = formatdate("YYYY-MM-DD'T'hh:mmZ", time_offset.deployer_scale_down[0].rfc3339)
  }
}

To explain briefly, the code creates 2 scheduled policies. The first policy is designed to scale the cluster up to 200% (double the number of nodes) after var.deployer_scale_up_delay. The delay here is needed to give a brief time to the OTC AS to bind and start using the new node configuration.

The second policy is to scale the cluster down to 50% (halve the number of nodes) after var.deployer_scale_down_delay. This delay should accommodate for the boot up times of the nodes (more precisely; time it takes until ELB sees the node as healthy) as well as the delay in the first policy. The constraint here is that var.deployer_scale_down_delay > var.deployer_scale_up_delay + node_healthy_time + some_margin_for_reliability.

While these two alone are enough to handle the deployment, there is a small time lock also implemented to prevent terraform execution until the scale down policy finishes its job completely. var.deployer_lock_duration is simply just halting terraform in apply mode and ideally should be set to var.deployer_lock_duration > var.deployer_scale_down_delay + node_shutdown_time + some_margin_for_reliability. This way, the user of the script is discouraged from running back-to-back deployments before one is finished.

While the solution is not super clean, it gets the job done and allows automated and reliable blue-green deployments of for example new images to the system without downtime. Also important to keep in mind is that the time resolution of the AS policies are in minutes so the margins on policies should be at least a minute long to accommodate for the worst case scenario. In my project the values were set to:

deployer_scale_up_delay   = 90    // 90 sec before any policy execution: 30 seconds delay, 60 sec margin
deployer_scale_down_delay = 490   // 490 sec before scale down: 90 sec due to initial delay, 300 sec node boot time, 100 sec margin
deployer_lock_duration    = "10m" // 600 sec lock on terraform execution: 490 sec due to scale_down, 60 seconds for shutdown, 50 sec margin

Hope it still works and helps you. Best, Can

canaykin commented 1 year ago

Oh and I forgot,

If you do not care about the downtime and just want to replace all nodes as quickly as possible, a nicer way to achieve this is to use replace_triggered_by lifecycle rule in the as_group:

  lifecycle {
    replace_triggered_by = [
      opentelekomcloud_as_configuration_v1.autoscaling_config,
    ]
  }

This way, a change in as_config will first recreate it, and then destroy the existing AS group along with its nodes to create a new one using the new as_config. This can also be done more specifically to do the recreation only on e.g. image_id updates with:

  lifecycle {
    replace_triggered_by = [
      opentelekomcloud_as_configuration_v1.autoscaling_config.instance_config[0].image,
    ]
  }

This solution is cleaner in my opinion but lacks the capabilities of the above one. Best, Can

valkaitibor commented 11 months ago

thank you for the help, I have closed this issue :)