pulumi / pulumi-aws

An Amazon Web Services (AWS) Pulumi resource package, providing multi-language access to AWS
Apache License 2.0
458 stars 155 forks source link

Creating monitoring schedule returns error during operation 'READ' #2419

Open TheArvinLim opened 1 year ago

TheArvinLim commented 1 year ago

Pulumi returns READ error after creating monitoring schedule and crashes.

Expected behavior

Monitoring schedule is created and no READ error occurs, the Pulumi program finishes without errors.

Current behavior

After creating a data quality job definition and passing it into a monitoring schedule, Pulumi returns the following error:

GeneralServiceException: AWS::SageMaker::MonitoringSchedule Handler returned status FAILED: Error occurred during operation 'READ'. 
(HandlerErrorCode: GeneralServiceException, RequestToken: 1a7a4950-9234-4108-913d-0543e623defe)

This occurs after the monitoring schedule is created - the schedule can be seen created successfully in Sagemaker Studio, but immediately after Pulumi throws the READ error and crashes.

Steps to reproduce

  1. Define job definition as follows:
    data_quality_job_definition = aws_native.sagemaker.DataQualityJobDefinition(
                resource_name=f"{workspace_name}-data-quality-job",
                job_definition_name=f"{workspace_name}-data-quality-job",
                endpoint_name=endpoint.name,
                data_quality_app_specification=aws_native.sagemaker.DataQualityJobDefinitionDataQualityAppSpecificationArgs(
                    image_uri=f"245545462676.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-model-monitor-analyzer:latest"  
                ),
                data_quality_job_input=aws_native.sagemaker.DataQualityJobDefinitionDataQualityJobInputArgs(
                    endpoint_input=aws_native.sagemaker.DataQualityJobDefinitionEndpointInputArgs(
                        endpoint_name=endpoint.name,
                        local_path="/opt/ml/processing/endpointdata",
                    )
                ),
                data_quality_job_output_config=aws_native.sagemaker.DataQualityJobDefinitionMonitoringOutputConfigArgs(
                    monitoring_outputs=[
                        aws_native.sagemaker.DataQualityJobDefinitionMonitoringOutputArgs(
                            s3_output=aws_native.sagemaker.DataQualityJobDefinitionS3OutputArgs(
                                local_path="/opt/ml/processing/localpath",
                                s3_uri=f"s3://{workspace_name}-endpoint-bucket/data_quality_monitor_output",
                            )
                        ),
                    ]
                ),
                job_resources=aws_native.sagemaker.DataQualityJobDefinitionMonitoringResourcesArgs(
                    cluster_config=aws_native.sagemaker.DataQualityJobDefinitionClusterConfigArgs(
                        instance_count=1,
                        instance_type="ml.t3.medium",
                        volume_size_in_gb=1,
                    )
                ),
                role_arn=sagemaker_full_access_role.arn,
            )
  2. Define monitoring schedule as follows:
    data_quality_monitoring_schedule = aws_native.sagemaker.MonitoringSchedule(
                resource_name=f"{workspace_name}-data-quality-monitoring-schedule",
                endpoint_name=endpoint.name,
                monitoring_schedule_name=f"{workspace_name}-data-quality-monitoring-schedule",
                monitoring_schedule_config=aws_native.sagemaker.MonitoringScheduleConfigArgs(
                    monitoring_job_definition_name=data_quality_job_definition.job_definition_name,
                    monitoring_type=aws_native.sagemaker.MonitoringScheduleMonitoringType.DATA_QUALITY,
                    schedule_config=aws_native.sagemaker.MonitoringScheduleScheduleConfigArgs(
                        schedule_expression="cron(0 0 ? * * *)",  # daily
                    ),
                ),
            )
  3. Use Pulumi to update. Get the following error:
    +  aws-native:sagemaker:MonitoringSchedule test-data-quality-monitoring-schedule **creating failed** error: reading resource state: reading resource 
    state: operation error CloudControl: GetResource, https response error StatusCode: 400, RequestID: 09f8dc65-4c3f-494e-873d-b8dc0d6245dc, 
    GeneralServiceException: AWS::SageMaker::MonitoringSchedule Handler returned status FAILED: Error occurred during operation 'READ'. 

Context (Environment)

We are trying to create resources that are used for model deployment on Sagemaker. Part of this is taking a model in the model registry, deploying it to an endpoint and then creating data quality / model quality job schedules on that endpoint.

The issue is that Pulumi crashes before finishing, so even though the schedule seems to be created successfully, resources that are deployed afterwards will not be deployed.

I have tested the above flow using the AWS CLI, and the definition / schedule seem to be created without any issues. I have also tried using the CLI to describe the definition / schedule, and there is no issues with reading the resource properties. The issue seems to lie with Pulumi.

thomas11 commented 1 year ago

Hi @TheArvinLim, thanks for reporting this! Would you be able to help me complete your code sample to reproduce the issue? What are endpoint and sagemaker_full_access_role?

Also, the output of pulumi about would be very helpful to determine all versions.

TheArvinLim commented 1 year ago

Hi @thomas11, thanks for the quick reply!

This is the output of pulumi about:

CLI          
Version      3.48.0
Go Version   go1.19.2
Go Compiler  gc

Host     
OS       debian
Version  11.6
Arch     aarch64

Definition of sagemaker_full_access_role:

sagemaker_full_access_role = aws.iam.Role(
                resource_name="sagemaker-full-access",
                name="sagemaker-full-access",
                assume_role_policy=json.dumps(
                    {
                        "Version": "2012-10-17",
                        "Statement": [
                            {
                                "Sid": "",
                                "Effect": "Allow",
                                "Principal": {"Service": "sagemaker.amazonaws.com"},
                                "Action": "sts:AssumeRole",
                            }
                        ],
                    }
                ),
                managed_policy_arns=[
                    "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess",
                ],

Definition of endpoint:

s3_uri = f"s3://{workspace_name}-endpoint-bucket/endpoint-data-capture-logs/"
endpoint_configuration = aws.sagemaker.EndpointConfiguration(
                resource_name="test_endpoint",
                name="test_endpoint",
                data_capture_config=aws.sagemaker.EndpointConfigurationDataCaptureConfigArgs(
                    destination_s3_uri=s3_uri,
                    initial_sampling_percentage=100,
                    enable_capture=True,
                    capture_options=[
                        aws.sagemaker.EndpointConfigurationDataCaptureConfigCaptureOptionArgs(capture_mode="Output"),
                        aws.sagemaker.EndpointConfigurationDataCaptureConfigCaptureOptionArgs(capture_mode="Input"),
                    ],
                    capture_content_type_header=aws.sagemaker.EndpointConfigurationDataCaptureConfigCaptureContentTypeHeaderArgs(
                        csv_content_types=["text/csv"], json_content_types=["application/json"]
                    ),
                ),
                production_variants=[
                    aws.sagemaker.EndpointConfigurationProductionVariantArgs(
                        variant_name="test_variant",
                        model_name="test_model",
                        initial_instance_count=1,
                        instance_type="ml.m5.xlarge",
                    )
                ],
            )
endpoint = aws.sagemaker.Endpoint(
                resource_name="test_model",
                name="test_model",
                endpoint_config_name=endpoint_configuration.id,
            )