hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.86k stars 9.21k forks source link

terraform sagemaker endpoint creation failed!!! #32734

Open arpita497 opened 1 year ago

arpita497 commented 1 year ago

Terraform Core Version

1.3.4

AWS Provider Version

5.10.0

Affected Resource(s)

resource "aws_sagemaker_endpoint" "endpoint" {
  name                 = "mynewendpoint"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.endpoint_config.name
}

resource "aws_sagemaker_endpoint_configuration" "endpoint_config" {
  name = "endpointconfig1"

  production_variants {
    variant_name           = "variant1"
    model_name             = aws_sagemaker_model.model.name
    initial_instance_count = 1                      
    instance_type          = "ml.m4.xlarge"
  }
}

resource "aws_sagemaker_model" "model" {
  name               = "model1"
  execution_role_arn = aws_iam_role.example.arn

  primary_container {
    image = data.aws_sagemaker_prebuilt_ecr_image.test.registry_path
  }
}

data "aws_sagemaker_prebuilt_ecr_image" "test" {
  repository_name = "knn"
}

resource "aws_iam_role" "example" {
  name               = "role2"
  path               = "/"
  assume_role_policy = data.aws_iam_policy_document.example.json
}

data "aws_iam_policy_document" "example" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}

Expected Behavior

The sagemaker endpoint should get created.

Actual Behavior

Error: waiting for SageMaker Endpoint (mynewendpoint) to be in service: ResourceNotReady: failed waiting for successful resource state

Cloudwatch logs:

Customer Error: Unable to load model (caused by PlatformError) Caused by: No files were found in /opt/ml/model File "/opt/amazon/lib/python3.7/site-packages/ai_algorithms_sdk/base/exceptions.py", line 89, in raise_with_traceback raise exception.with_traceback(traceback) Returning bad request ping response due to setup issues.

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

The configuration files are present in above description.

Steps to Reproduce

When I do terraform plan, it gets successful. but while doing terraform apply, the sagemaker model and configuration are created but the endpoint is getting failed.

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 1 year ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

arpita497 commented 1 year ago

@teamterraform @terraformbot @kvedangamazon @aws-mkandylis

justinretzolk commented 1 year ago

Hey @arpita497 👋 Thank you for taking the time to raise this! In looking at this, with the Cloudwatch log snippet that you supplied, my initial feeling is that this is likely due to an issue with the configuration of the model, but I can't say for certain. Are you able to supply debug logs (redacted as needed) so that we have that information as well?

arpita497 commented 1 year ago

Hey @arpita497 👋 Thank you for taking the time to raise this! In looking at this, with the Cloudwatch log snippet that you supplied, my initial feeling is that this is likely due to an issue with the configuration of the model, but I can't say for certain. Are you able to supply debug logs (redacted as needed) so that we have that in formation as well?

error1 error2
justinretzolk commented 1 year ago

Hey @arpita497 👋 Apologies, I should have been more specific. Can you supply debug logs from Terraform so that we can take a look and see if there's anything that would indicate that this was a bug within the AWS Provider? Based on the screenshots you provided, I still suspect that this is an issue with the model, but want to make sure and entirely eliminate the possibility that something unexpected is happening on the Terraform side of things.

justinretzolk commented 2 weeks ago

Hey @arpita497 👋 I wanted to check in here to see if you were still experiencing this, or if it's safe for us to close this issue out.