Allow specification of sort and dist keys

Limess commented 4 years ago

Allow specification of sort and dist keys on a per-stream basis.

We now occassionally get recommendations from Redshift advisor to add sort/dist keys to columns, or identify columns which would benefit ourselves.

It'd be nice to codeify these into our pipeline rather than running a one-off SQL statement to add them and not capturing this anywhere concrete.

Suggestion:

Respect distribution-key (string) and sort-keys (array) in metadata.metadata[0] in the singer catalogue.

koszti commented 4 years ago

Supporting dist and sort keys makes sense. But sending it as part of the singer catalog maybe not the best option. Singer catalogs should talk only about the source databases and redshift specific parameters should be defined somehow by this target.

Another way of doing it by specifying sort and dist keys as part of the schema_mapping object. Schema mapping is generated automatically by using the tap catalog and the PipelineWise tap YAML files. The input YAML file would look something like this

...
schemas:
  - source_schema: "my_db"        # Source schema (aka. database) in MySQL/ MariaDB with tables
    target_schema: "my_db"        # Target schema in the destination Data Warehouse
    tables:
      - table_name: "table_one"
        replication_method: "LOG_BASED"   # One of INCREMENTAL, LOG_BASED and FULL_TABLE
        sort_key: column_one
        distribution_key: column_two
...

If you use this target-redshift as alone and not from pipelinewise then you'd need to generate the schema_mapping by yourself.

What do you think? Would that work for you?

Limess commented 4 years ago

That'd be great standalone - we already programatically generate our catalogs and configuration so this would be a small extra step and I agree that it doesn't fit into the catalog.

transferwise / pipelinewise-target-redshift

Allow specification of sort and dist keys #39