terraform-aws-modules / terraform-aws-batch

Terraform module to create AWS Batch resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/batch/aws
Apache License 2.0
35 stars 40 forks source link

Removal of attempt_duration_seconds makes the terraform plan incoherent #21

Closed rcosperec closed 1 year ago

rcosperec commented 1 year ago

Description

If a job is created with attempt_duration_seconds set and after its creation the attempt_duration_seconds is removed, a terraform plan will forever mark that the job definition must be replaced because of a missing empty timeout block.

⚠️ Note

Before you submit an issue, please perform the following first:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Reproduction Code [Required]

module "batch" {
  source = "terraform-aws-modules/batch/aws"

  compute_environments = {
    a_ec2 = {
      name_prefix = "ec2"

      compute_resources = {
        type           = "EC2"
        min_vcpus      = 4
        max_vcpus      = 16
        desired_vcpus  = 4
        instance_types = ["m5.large", "r5.large"]

        security_group_ids = ["sg-f1d03a88"]
        subnets            = ["subnet-30ef7b3c", "subnet-1ecda77b", "subnet-ca09ddbc"]

        # Note - any tag changes here will force compute environment replacement
        # which can lead to job queue conflicts. Only specify tags that will be static
        # for the lifetime of the compute environment
        tags = {
          # This will set the name on the Ec2 instances launched by this compute environment
          Name = "example"
          Type = "Ec2"
        }
      }
    }

    b_ec2_spot = {
      name_prefix = "ec2_spot"

      compute_resources = {
        type                = "SPOT"
        allocation_strategy = "SPOT_CAPACITY_OPTIMIZED"
        bid_percentage      = 20

        min_vcpus      = 4
        max_vcpus      = 16
        desired_vcpus  = 4
        instance_types = ["m4.large", "m3.large", "r4.large", "r3.large"]

        security_group_ids = ["sg-f1d03a88"]
        subnets            = ["subnet-30ef7b3c", "subnet-1ecda77b", "subnet-ca09ddbc"]

        # Note - any tag changes here will force compute environment replacement
        # which can lead to job queue conflicts. Only specify tags that will be static
        # for the lifetime of the compute environment
        tags = {
          # This will set the name on the Ec2 instances launched by this compute environment
          Name = "example-spot"
          Type = "Ec2Spot"
        }
      }
    }
  }

  # Job queus and scheduling policies
  job_queues = {
    low_priority = {
      name     = "LowPriorityEc2"
      state    = "ENABLED"
      priority = 1

      compute_environments = ["b_ec2_spot"]

      tags = {
        JobQueue = "Low priority job queue"
      }
    }

    high_priority = {
      name     = "HighPriorityEc2"
      state    = "ENABLED"
      priority = 99

      fair_share_policy = {
        compute_reservation = 1
        share_decay_seconds = 3600

        share_distribution = [{
          share_identifier = "A1*"
          weight_factor    = 0.1
          }, {
          share_identifier = "A2"
          weight_factor    = 0.2
        }]
      }

      tags = {
        JobQueue = "High priority job queue"
      }
    }
  }

  job_definitions = {
    example = {
      name           = "example"
      propagate_tags = true

      container_properties = jsonencode({
        command = ["ls", "-la"]
        image   = "public.ecr.aws/runecast/busybox:1.33.1"
        resourceRequirements = [
          { type = "VCPU", value = "1" },
          { type = "MEMORY", value = "1024" }
        ]
        logConfiguration = {
          logDriver = "awslogs"
          options = {
            awslogs-group         = "/aws/batch/example"
            awslogs-region        = "us-east-1"
            awslogs-stream-prefix = "ec2"
          }
        }
      })

      attempt_duration_seconds = 60
      retry_strategy = {
        attempts = 3
        evaluate_on_exit = {
          retry_error = {
            action       = "RETRY"
            on_exit_code = 1
          }
          exit_success = {
            action       = "EXIT"
            on_exit_code = 0
          }
        }
      }

      tags = {
        JobDefinition = "Example"
      }
    }
  }

  tags = {
    Terraform   = "true"
    Environment = "dev"
  }
}

Steps to reproduce the behavior:

Expected behavior

No changes needs to be made

Actual behavior

The job definition must be replaced because of empty timeout

Terminal Output Screenshot(s)

image

Additional context

bryantbiggs commented 1 year ago

this is due to what is specified at the provider - this module has no control over this behavior https://github.com/hashicorp/terraform-provider-aws/blob/3273a9b01e75b1364f608838ba7898dabbccf76a/internal/service/batch/job_definition.go#L161

Looking here https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-batch-jobdefinition.html#cfn-batch-jobdefinition-timeout it looks like this should not require any interruption so I would open a ticket with the Terraform AWS provider

bryantbiggs commented 1 year ago

closing here for now

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.