Timeout while waiting for state to become 'success' introduced with v1.8.0

SimonNightingale commented 5 years ago

Hi there, we have encountered an issues since using v1.8.0 of this provider. When we run the script everything will progress as usual until roughly 5 minutes in when we are presented with the following error:

spotinst_elastigroup_aws.corporate: timeout while waiting for state to become 'success' (timeout: 5m0s)

This issue has only developed since v1.8.0 has been released. If we roll back the provider to v1.7.0 and run the same script everything will work correctly as it was doing before the update.

Also an unrelated note is the documentation correct on elastigroups as it seems there is a duplicated version of the wait_for_roll_percentage or am I just not understanding it correctly

update_policy = {
    should_resume_stateful = false
    should_roll            = false
    auto_apply_tags        = false
    wait_for_pct_complete  = 10
    wait_for_pct_timeout   = 1500

    roll_config = {
      batch_size_percentage = 33
      health_check_type     = "ELB"
      grace_period          = 300
      wait_for_roll_percentage = 10
      wait_for_roll_timeout    = 1500
    }
}

Any help would be appreciated with this.

Terraform Version

Terraform v0.11.8

provider.spotinst v1.8.0
provider.aws v2.0.0

Affected Resource(s)

spotinst_elastigroup_aws

Terraform Configuration Files

resource "spotinst_elastigroup_aws" "corporate" {
  name                          = "${var.octopus_project_safe}-${var.octopus_environment} - ${var.version}"
  description                   = "created by Terraform"
  product                       = "Windows"
  subnet_ids                    = ["${var.aws_subnet1}", "${var.aws_subnet2}", "${var.aws_subnet3}"]
  max_size                      = "${var.asg_max_instances}"
  min_size                      = "${var.asg_min_instances}"
  desired_capacity              = "${var.asg_desired_capacity}"
  capacity_unit                 = "instance"
  region                        = "${var.region}"
  image_id                      = "${var.aws_ami}"
  iam_instance_profile          = "webserver"
  key_name                      = "cloudit"
  security_groups               = ["${var.aws_securitygroups}"]
  user_data                     = "${file("StartService.ps1")}"
  enable_monitoring             = false
  ebs_optimized                 = false
  instance_types_ondemand       = "${var.aws_instance_type}"
  instance_types_spot           = ["${var.spotinst_instance_types}"]
  instance_types_preferred_spot = ["${var.aws_instance_type}"]
  orientation                   = "costOriented"
  fallback_to_ondemand          = true
  spot_percentage               = 100

  wait_for_capacity         = "${var.asg_min_instances}"
  wait_for_capacity_timeout = 900

  scheduled_task = [
    {
      task_type             = "scale"
      cron_expression       = "${var.asg_recurrence_night}"
      scale_target_capacity = "${var.asg_desired_capacity_night}"
      scale_min_capacity    = "${var.asg_min_instances_night}"
      scale_max_capacity    = "${var.asg_max_instances_night}"
      is_enabled            = true
    },
    {
      task_type             = "scale"
      cron_expression       = "${var.asg_recurrence_day}"
      scale_target_capacity = "${var.asg_min_instances}"
      scale_min_capacity    = "${var.asg_min_instances}"
      scale_max_capacity    = "${var.asg_max_instances}"
      is_enabled            = true
    },
  ]

  scaling_up_policy = {
    policy_name = "corporate-CPU-scaling-up"
    metric_name = "CPUUtilization"
    statistic   = "average"
    unit        = "percent"
    threshold   = 75
    action_type = "adjustment"
    adjustment  = "1"
    namespace   = "AWS/EC2"

    dimensions = {
      name = "AutoScalingGroupName" 
      value = "${var.octopus_project_safe}-${var.octopus_environment}-Up"
    }

    period             = 300
    evaluation_periods = 2
    cooldown           = 300
    operator           = "gte"
  }

  scaling_down_policy = {
    policy_name = "corporate-CPU-scaling-down"
    metric_name = "CPUUtilization"
    statistic   = "average"
    unit        = "percent"
    threshold   = 35
    action_type = "adjustment"
    adjustment  = "1"
    namespace   = "AWS/EC2"

    dimensions = {
       name = "AutoScalingGroupName" 
       value = "${var.octopus_project_safe}-${var.octopus_environment}-Down"
    }

    period             = 300
    evaluation_periods = 2
    cooldown           = 300
    operator           = "lte"
  }

  health_check_grace_period = "${var.asg_health_check_grace_period}"
  health_check_type         = "TARGET_GROUP"
  target_group_arns         = ["${aws_alb_target_group.corporate.arn}"]

  update_policy = {
    should_resume_stateful = false
    should_roll            = true

    roll_config = {
      batch_size_percentage = 100
      health_check_type     = "TARGET_GROUP"
      grace_period          = 900

      wait_for_roll_percentage = 100
      wait_for_roll_timeout = 900
    }
  }

  tags = [
    {
      key   = "Name"
      value = "${var.octopus_project_safe}-${var.octopus_environment}-${var.version}"
    },
    {
      key   = "OctopusProject"
      value = "${var.octopus_project}"
    },
    {
      key   = "OctopusEnvironment"
      value = "${var.octopus_environment}"
    },
  ]

  lifecycle {
    ignore_changes = [
      "desired_capacity",
    ]
  }
}

Panic Output

2019-03-04T13:18:41.603Z [DEBUG] plugin.terraform-provider-spotinst_v1.8.0_x4.exe: ===> waiting for at least 100% of batches to complete, currently 0% <===
2019-03-04T13:18:41.603Z [DEBUG] plugin.terraform-provider-spotinst_v1.8.0_x4.exe: [TRACE] Waiting 10s before next try
2019-03-04T13:18:41.649Z [DEBUG] plugin.terraform-provider-spotinst_v1.8.0_x4.exe: [ERROR] WaitForState exceeded refresh grace period
2019-03-04T13:18:41.649Z [DEBUG] plugin.terraform-provider-spotinst_v1.8.0_x4.exe: [ERROR] Group [sig-ccc57383] roll failed, error: timeout while waiting for state to become 'success' (timeout: 5m0s)
2019/03/04 13:18:41 [TRACE] root: eval: *terraform.EvalWriteState
2019/03/04 13:18:41 [TRACE] root: eval: *terraform.EvalApplyProvisioners
2019/03/04 13:18:41 [TRACE] root: eval: *terraform.EvalIf
2019/03/04 13:18:41 [TRACE] root: eval: *terraform.EvalWriteState
2019/03/04 13:18:41 [TRACE] root: eval: *terraform.EvalWriteDiff
2019/03/04 13:18:41 [TRACE] root: eval: *terraform.EvalApplyPost
2019/03/04 13:18:41 [ERROR] root: eval: *terraform.EvalApplyPost, err: 1 error(s) occurred:

* spotinst_elastigroup_aws.corporate: timeout while waiting for state to become 'success' (timeout: 5m0s)
2019/03/04 13:18:41 [ERROR] root: eval: *terraform.EvalSequence, err: 1 error(s) occurred:

* spotinst_elastigroup_aws.corporate: timeout while waiting for state to become 'success' (timeout: 5m0s)

Expected Behavior

Executed without timeout error

Actual Behavior

Timeout error appears

Steps to Reproduce

terraform apply using v1.8.0

alexindeed commented 5 years ago

1.8.0 introduced a second retry with a max timeout of 5 mins. A second, nested retry has a minimum timeout of 25 mins. This causes the apply to time out whenever a roll is taking longer than 5 mins. I have corrected the behavior.

The duplicated wait_for_roll_percentage and wait_for_roll_timeout will also be updated. These fields were moved around during a revision that accidentally made it back in during a rebase (causing the duplication to also appear in the docs).

SimonNightingale commented 5 years ago

Thank you, i've just tested the v1.9.0 release and it has fixed this issue.

Ala005 commented 5 years ago

Hi there, we are copying an ami from another ami, its timedout at 40 min. (yes, ami has large amount of date) Is there a way to fix it.

Error waiting for ami to be ready : timeout while waiting for state to become 'available' (last date: 'pending', timeout 40m0s)

the ami is getting tainted state in terraform tfstate file.

Thank you!

itzikadz commented 5 years ago

Hi there, we are copying an ami from another ami, its timedout at 40 min. (yes, ami has large amount of date) Is there a way to fix it.

Error waiting for ami to be ready : timeout while waiting for state to become 'available' (last date: 'pending', timeout 40m0s)

the ami is getting tainted state in terraform tfstate file.

Thank you!

Which provider you're using to the AMI copy operation?

Ala005 commented 5 years ago

We are using AWS provider.

Thanks

On Sat, 19 Oct, 2019, 12:18 pm itzikadz, notifications@github.com wrote:

Hi there, we are copying an ami from another ami, its timedout at 40 min. (yes, ami has large amount of date) Is there a way to fix it.

Error waiting for ami to be ready : timeout while waiting for state to become 'available' (last date: 'pending', timeout 40m0s)

the ami is getting tainted state in terraform tfstate file.

Thank you!

Which provider you're using to the AMI copy operation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/terraform-providers/terraform-provider-spotinst/issues/37?email_source=notifications&email_token=ANESLANHTVPBBNRJFW3HSQ3QPKUVPA5CNFSM4G4BW4EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBXG5SA#issuecomment-544108232, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANESLAJTB6E6Z3OKIYEZX6DQPKUVPANCNFSM4G4BW4EA .

itzikadz commented 5 years ago

@Ala005, please adress this issue to the relevant provider: https://github.com/terraform-providers/terraform-provider-aws. we can't control the AWS provider.

spotinst / terraform-provider-spotinst

Timeout while waiting for state to become 'success' introduced with v1.8.0 #37