hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.72k stars 9.08k forks source link

[Bug]: AWS Sagemaker Endpoint Config "routing_config"."routing_strategy" should not be optional if instance_count > 1 #38352

Open njbrake opened 1 month ago

njbrake commented 1 month ago

Terraform Core Version

1.9.2

AWS Provider Version

5.58.0

Affected Resource(s)

Expected Behavior

When scaling up to more than one instance counts in a production variant, we would expect some sort of load balancing behavior even without specifying, anything and routing strategy, since that variable is marked as being optional

Actual Behavior

When we scaled up to more than one instance and didn't specify anything in the routing strategy, we were finding that traffic was being duplicated to each instance instead of being load bounced. In other words, one request would be routed to both instance one and two. Without specifying routing strategy, it's unclear what routing strategy is actually being used by AWS since AWS says that that parameter is required

Relevant Error/Panic Output Snippet

This is the parameter marked as optional: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/sagemaker_endpoint_configuration#routing_strategy

But AWS docs say that it is not optional: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariantRoutingConfig.html

Terraform Configuration Files

N/A

Steps to Reproduce

  1. Spin up an endpoint with an instant count of 2, with no routing_config specified
  2. Make request to the endpoint, and see that no load balancing is happening, the request is being duplicated and sent to both instance 1 and 2.

Debug Output

N/A

Panic Output

N/A

Important Factoids

I think this bug could be probably addressed by determining what configuration value is being sent to AWS when it's not specified in the terraform code. Since AWS says that the parameter is required, I assume that something is being set. We just aren't aware of it. To make matters more confusing, the AWS console doesn't display the routing config parameters as part of the endpoint configuration so I haven't been able to figure out what the value was when we don't set it. I think it's a combo of missing documentation on the AWS side which then has led to some unclear documentation on the terraform side as well.

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 1 month ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

justinretzolk commented 1 month ago

Hey @njbrake 👋 Thank you for taking the time to raise this! While triaging, I took a quick look through the linked documentation, and wanted to clarify one bit of information. The routing_config block as a whole is marked as optional in the AWS Provider documentation, while the routing_strategy is marked as required. This indicates that if the (optional) routing_config block is supplied, the routing_strategy argument must also be supplied.

This is consistent with the SageMaker API's CreateEndpointConfig, which has the RoutingConfig marked as optional, but if it's supplied, RoutingStrategy is required.

Unfortunately I wasn't able to find a clear answer for the overall issue either, so I'll leave this open for someone on the team or community to weigh in on.

njbrake commented 1 month ago

Thank you for the quick response and triaging! Yes my concern is that I think routing_config needs to be required whenever the instance_count is set to above 1. It may also be an issue with the sagemaker documentation.