hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.76k stars 9.12k forks source link

[Bug]: Updates to aws_sagemaker_domain and aws_sagemaker_user_profile recreate the domain and we lose access to the existing EFS server and files within SageMaker Studio #29331

Open rubencg195 opened 1 year ago

rubencg195 commented 1 year ago

Terraform Core Version

0.13.7

AWS Provider Version

4.53.0

Affected Resource(s)

Whenever we do a networking update, like updating the security group rules, or other type, like updating the jupyter image ARN, to aws_sagemaker_domain and aws_sagemaker_user_profile, it recreates the domain, loses access to the previous EFS server, creates a new one and, loses access to the existing files in the previous EFS. The aws representatives mentioned they can't do anything about it and they recommended reporting the issue to Hashicorp, and in the meantime, use their guides to backup the EFS data to S3, and use EC2s to mount both EFS and move data from the old to the new one which is a lot of manual work and could easily be fixed by adding an option to the aws_sagemaker_domain and aws_sagemaker_user_profile to specify an existing EFS id instead of creating a new one.

Expected Behavior

The domain should keep a reference to the existing EFS server, not create a new one, and not loose reference to the files that appear on the SageMaker Studio's filesystem. Please, add an option to the aws_sagemaker_domain and aws_sagemaker_user_profile resources to specify an existing EFS id instead of creating a new one.

Actual Behavior

The domain is recreated, reference to the existing EFS with the files is lost, and the files in the SageMaker Studio filesystem are wiped.

Relevant Error/Panic Output Snippet

No error, files just do not appear on the SageMaker Studio filesystem after an update.

Terraform Configuration Files

terraform {
  required_version = "0.13.7"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~>4.53.0"
    }
  }
}

provider aws {
  region                  = "us-east-1"
}

Steps to Reproduce

A basic example of a change that triggers the recreation of the domain and user is updating the jupyter version or changing the security group rules.

Before

resource aws_sagemaker_domain sagemaker_domain {
  domain_name             = var.domain_name
  auth_mode               = var.auth_mode
  vpc_id                  = var.vpc_id
  subnet_ids              = var.subnet_ids
  kms_key_id              = var.kms_key_id
  app_network_access_type = "VpcOnly"
  tags                    = var.tags

  default_user_settings {
    execution_role  = var.execution_role_arn
    security_groups = [
      aws_security_group.domain_sg.id
    ]
  }
}

After

resource aws_sagemaker_domain sagemaker_domain {
  domain_name             = var.domain_name
  auth_mode               = var.auth_mode
  vpc_id                  = var.vpc_id
  subnet_ids              = var.subnet_ids
  kms_key_id              = var.kms_key_id
  app_network_access_type = "VpcOnly"
  tags                    = var.tags

  default_user_settings {
    execution_role  = var.execution_role_arn
    security_groups = [
      aws_security_group.domain_sg.id
    ]
    jupyter_server_app_settings {
      default_resource_spec {
        instance_type       = "system"
        sagemaker_image_arn = var.jupyter_server_image_arn
      }
    }
  }
}

Debug Output

N/A. Files just do not appear of the SageMaker filesystem after an update.

Panic Output

N/A. Files just do not appear of the SageMaker filesystem after an update.

Important Factoids

Please, add an option to the aws_sagemaker_domain and aws_sagemaker_user_profile resources to specify an existing EFS id instead of creating a new one.

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 1 year ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

rubencg195 commented 1 year ago

Hi team, any updates about this, thanks.

DrFaust92 commented 1 year ago

Hi @rubencg195, as you can in AWS docs for sagemaker domain https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateDomain.html there is no option to reference existing domain. you can retain the implictly created filesystem using https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/sagemaker_domain#retention_policy (not sure how data behaves but as above this is on AWS API side and provider doesnt explicitly delete anything)

But still, when one recreates the domain, it cannot use an existing EFS FS, this an AWS limitation

MarvinBeGood commented 1 year ago

i think to fix this problem you need to add the following code to your terraform code:

  retention_policy {
    home_efs_file_system = "Retain"
  }

The Default Value is not working at Terraform so you need to set it. image