hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.86k stars 9.21k forks source link

[Bug]: Sagemaker endpoint update fails when changing endpoint configuration #40048

Open kylet21 opened 2 weeks ago

kylet21 commented 2 weeks ago

Terraform Core Version

= 1.4.6

AWS Provider Version

~> 5.0

Affected Resource(s)

aws_sagemaker_endpoint_configuration aws_sagemaker_endpoint

Expected Behavior

The new sagemaker endpoint configuration is associated with the endpoint, the endpoint transitions to Updating status, and then finally back to active, then the old endpoint configuration is deleted.

Actual Behavior

The endpoint is not updated successfully because the old endpoint configuration is deleted before the endpoint is updated. The error output when viewing the endpoint in the console is:

ValidationException: Could not find endpoint configuration "<old_endpoint_config_arn>".

The endpoint configuration has a lifecycle rule for create_before_delete, but that is not helping.

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

resource "aws_sagemaker_model" "main_model" {
  name               = "${var.model_name}-${var.prefix}-model-${random_id.force_new_resources.hex}"
  execution_role_arn = var.sagemaker_execution_role

  primary_container {
    image          = var.use_prebuild_image ? local.prebuilt_ecr_repo_uri : local.custom_ecr_repo_uri 
    model_data_url = var.model_artifact_location
    environment    = var.environment_variables
  }

  vpc_config {
    security_group_ids = var.security_groups
    subnets            = var.subnet_ids
  }

  tags = local.sagemaker_common_tags

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_sagemaker_endpoint_configuration" "main_model" {
  name = "${var.model_name}-${var.prefix}-endpointconfig-${random_id.force_new_resources.hex}"

  production_variants {
    variant_name           = local.variant_name
    model_name             = aws_sagemaker_model.main_model.name
    initial_instance_count = var.instance_count
    instance_type          = var.instance_type
    enable_ssm_access      = var.enable_endpoint_ssm_access
  }

  tags = local.sagemaker_common_tags

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_sagemaker_endpoint" "main_model_endpoint" {
  name = "${var.model_name}-${var.realm}-inference-endpoint"

  endpoint_config_name = aws_sagemaker_endpoint_configuration.main_model.name

  tags = local.sagemaker_common_tags
}

Steps to Reproduce

Update the model by providing a new model_data_url and prefix for the resource(s) This will also cause an update to the endpoint_configuration Run a terraform apply to update the endpoint with this new endpoint_configuration and model

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 2 weeks ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

baskervilski commented 1 week ago

Just to add that this is a blatant bug, as the official AWS documentation on UpdateEndpoint clearly states:

Note You must not delete an EndpointConfig in use by an endpoint that is live or while the UpdateEndpoint or CreateEndpoint operations are being performed on the endpoint. To update an endpoint, you must create a new EndpointConfig.

If you delete the EndpointConfig of an endpoint that is active or being created or updated you may lose visibility into the instance type the endpoint is using. The endpoint must be deleted in order to stop incurring charges.

https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html

Any suggestions for workarounds until this is fixed?