hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.74k stars 9.1k forks source link

[Bug]: Sagemaker endpoint updates can fail silently #39099

Closed Shi-vasana closed 3 days ago

Shi-vasana commented 2 weeks ago

Terraform Core Version

1.9.0

AWS Provider Version

5.63.1

Affected Resource(s)

aws_sagemaker_endpoint

Expected Behavior

When Terraform apply attempts to call UpdateEndpoint on an aws_sagemaker_endpoint resource, it should ensure that the changes requested actually get applied.

However, the UpdateEndpoint operation will succeed so long as it doesn't have any immediate errors in it (e.g. wrong permissions, invalid resources). If the endpoint is already in service, then the failure of UpdateEndpoint can result in that endpoint remaining in service, with a failure reason populated in the DescribeEndpoint output.

Actual Behavior

Terraform plan was applied with no indication of failure, but requested changes were not made.

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

resource "aws_sagemaker_endpoint" "sagemaker_endpoint" {
  # Do not change this - this resource name is used in invocations of the endpoint.
  name                 = "my-endpoint-name"
  # Change this when updating the endpoint configuration
  endpoint_config_name = aws_sagemaker_endpoint_configuration.this_was_changed.name
  tags                 = local.tags

  lifecycle {
    create_before_destroy = true
  }
}

Steps to Reproduce

This cannot be deterministically reproduced easily. It happens when AWS runs into a service error when attempting to update the endpoint, which is populated in the FailureReason field in DescribeEndpoint output:

$ aws sagemaker describe-endpoint --endpoint-name my-endpoint-name --region us-east-1
{
    "EndpointName": "my-endpoint-name",
    "EndpointArn": "arn:aws:sagemaker:us-east-1:MYAWSACCOUNTID:endpoint/my-endpoint-name",
    "EndpointConfigName": "terraform-20240827195426417200000002",
    "ProductionVariants": [
        {
            "VariantName": "default",
            "DeployedImages": [
                {
                    "SpecifiedImage": "...",
                    "ResolvedImage": "...",
                    "ResolutionTime": "2024-08-27T19:20:50.206000-07:00"
                }
            ],
            "CurrentWeight": 1.0,
            "DesiredWeight": 1.0,
            "CurrentInstanceCount": 3,
            "DesiredInstanceCount": 3
        }
    ],
    "EndpointStatus": "InService",
    "FailureReason": "Unable to locate at least 2 availability zone(s) with the requested instance type ml.m5.large that overlap with SageMaker subnets",
    "CreationTime": "2023-11-20T15:43:39.080000-08:00",
    "LastModifiedTime": "2024-08-27T19:33:11.634000-07:00"
}

Note that EndpointStatus is InService despite the presence of a FailureReason, indicating that the pre-existing endpoint configuration remains in place.

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 2 weeks ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

github-actions[bot] commented 3 days ago

[!WARNING] This issue has been closed, meaning that any additional comments are hard for our team to see. Please assume that the maintainers will not see them.

Ongoing conversations amongst community members are welcome, however, the issue will be locked after 30 days. Moving conversations to another venue, such as the AWS Provider forum, is recommended. If you have additional concerns, please open a new issue, referencing this one where needed.

github-actions[bot] commented 2 days ago

This functionality has been released in v5.67.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!