Open Limess opened 4 years ago
Supporting dist and sort keys makes sense. But sending it as part of the singer catalog maybe not the best option. Singer catalogs should talk only about the source databases and redshift specific parameters should be defined somehow by this target.
Another way of doing it by specifying sort and dist keys as part of the schema_mapping
object. Schema mapping is generated automatically by using the tap catalog and the PipelineWise tap YAML files. The input YAML file would look something like this
...
schemas:
- source_schema: "my_db" # Source schema (aka. database) in MySQL/ MariaDB with tables
target_schema: "my_db" # Target schema in the destination Data Warehouse
tables:
- table_name: "table_one"
replication_method: "LOG_BASED" # One of INCREMENTAL, LOG_BASED and FULL_TABLE
sort_key: column_one
distribution_key: column_two
...
If you use this target-redshift as alone and not from pipelinewise then you'd need to generate the schema_mapping
by yourself.
What do you think? Would that work for you?
That'd be great standalone - we already programatically generate our catalogs and configuration so this would be a small extra step and I agree that it doesn't fit into the catalog.
Allow specification of sort and dist keys on a per-stream basis.
We now occassionally get recommendations from Redshift advisor to add sort/dist keys to columns, or identify columns which would benefit ourselves.
It'd be nice to codeify these into our pipeline rather than running a one-off SQL statement to add them and not capturing this anywhere concrete.
Suggestion:
Respect
distribution-key
(string) andsort-keys
(array) inmetadata.metadata[0]
in the singer catalogue.