opensearch-project / opensearch-catalog

The OpenSearch Catalog is designed to make it easier for developers and community to contribute, search and install artifacts like plugins, visualization dashboards, ingestion to visualization content packs (data pipeline configurations, normalization, ingestion, dashboards).
Apache License 2.0
17 stars 18 forks source link

[FEATURE] Add removal policies for the tables created by integrations #158

Open Jon-AtAWS opened 3 weeks ago

Jon-AtAWS commented 3 weeks ago

Is your feature request related to a problem? No

What solution would you like? If I use an integration, it creates various tables in GDC. When I delete the integration, OpenSearch does not delete those tables. While this is the right behavior in many scenarios, there are many other scenarios where I want to delete all associated resources, including the tables.

Can we add a "removal policy" to the integration that lets me specify what OpenSearch should do when the integration is deleted? The AWS Cloud Development Kit (CDK) has a RemovalPolicy that standardizes how the generated template should handle deletes for AWS resources. For example, I can create an S3 bucket with this Python code

        s3_bucket = s3.Bucket(self, f'MyGreatS3Bucket',
            block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
            bucket_name=BUCKET_NAME,
            enforce_ssl=True,
            versioned=True,
            removal_policy=cdk.RemovalPolicy.DESTROY,
            auto_delete_objects=True,
        )

cdk.RemovalPolicy.DESTROY causes CDK to generate the following resource in the CloudFormation template

  MyGreatBucket:
    Type: AWS::S3::Bucket
    Properties:
      ....
    UpdateReplacePolicy: Delete
    DeletionPolicy: Delete
      ...

Other choices are RemovalPolicy.SNAPSHOT and RemovalPolicy.RETAIN. This seems like a good way for me to specify what I want to happen.

Swiddis commented 3 weeks ago

Thanks for the request!

I'm thinking of how to try and go about it. We probably need to get the table and MV information added in the instance object, with steps to delete for each object type, and some flag. It might be tricky because we don't necessarily know ahead of time every type of GDC object that we'll need to delete (with DROP TABLE or DROP MATERIALIZED VIEW or other options).

Maybe adding some sort of glue_params object with a resources array:

"glue_params": {
    "removal_policy": "delete", // Assuming we don't need anything more granular than a global policy enum

    // Insert in install order, probably need to delete in reverse order
    // (does Spark SQL let you drop everything in parallel?)
    "resources": [
        {
            "name": "s3conn.database.table",
            "type": "table"
        },
        {
            "name": "s3conn.database.index",
            "type": "skipping_index"
        }
    ]
}

Then we just need to add a way to specify this policy in the install process, probably as part of building.

Related is the issue of letting integrations drop tables during a failed install if earlier queries succeed; being able to do that rollback would also require type-based dropping, but is tricky because many integrations use CREATE IF NOT EXISTS queries so we can't blindly delete everything we touch.

@YANG-DB can you triage?

Swiddis commented 3 weeks ago

can you triage?

"Fine, I'll do it myself."