transferwise / pipelinewise-target-redshift

Singer.io Target for Amazon Redshift - PipelineWise compatible
https://transferwise.github.io/pipelinewise/
Other
12 stars 65 forks source link

Why is COMPUPDATE explicitly disabled #16

Closed Limess closed 5 years ago

Limess commented 5 years ago

General question as to why COMPUPDATE is off in the Redshift COPY command.

After running some tables from postgres to pipelinewise-target-redshift I've checked the result compression using ANALYZE_COMPRESSION with this output:

Table Column Encoding Est_reduction_pct
raw_table zstd 89.54
raw_table zstd 89.54
raw_table zstd 89.08
raw_table zstd 88.84
raw_table zstd 88.36
raw_table zstd 88.16
raw_table zstd 88.04
raw_table zstd 83.90
raw_table zstd 75.15
raw_table zstd 72.93
raw_table zstd 67.32
raw_table zstd 64.90
raw_table zstd 63.35
raw_table zstd 61.31
raw_table zstd 55.30
raw_table zstd 54.52
raw_table zstd 53.87
raw_table zstd 50.25
raw_table zstd 48.17
raw_table zstd 47.91
raw_table zstd 44.15
raw_table zstd 42.95
raw_table zstd 42.73
raw_table zstd 40.92
raw_table zstd 39.82
raw_table zstd 38.98
raw_table zstd 38.87
raw_table zstd 38.87
raw_table zstd 35.16
raw_table zstd 34.74
raw_table zstd 34.58
raw_table zstd 34.56
raw_table zstd 21.42
raw_table zstd 20.73
raw_table zstd 20.65
raw_table _sdc_deleted_at lzo 0.00
raw_table _sdc_extracted_at raw 0.00
raw_table raw 0.00

I'm unsure if enabling COMPUPDATE to apply compression on the staging table would then copy encodings across to the final table or whether they're discarded hence it being disabled here? Does enabling compression slow down loads?

Is there another alternate way of applying compression to the final raw tables generated by this target?

Limess commented 5 years ago

Ran a quick test with COMPUPDATE on based off a fork and there was no difference to the output of ANALYZE compression for the test tables.

koszti commented 5 years ago

The original idea of adding COMPUPDATE OFF is based on a not too recent blog post at https://www.flydata.com/blog/how-to-improve-performance-upsert-amazon-redshift/ and maybe it's outdated.

At the moment I'm working on https://github.com/transferwise/pipelinewise-target-redshift/issues/8 that will give an option to overwrite the built-in COPY options. Do you think that would be helpful for you as well so you can add/remove extra options to COPY as you wish?

Limess commented 5 years ago

Thanks for the info and reference

I think https://github.com/transferwise/pipelinewise-target-redshift/issues/8 would be sufficient for further tuning or experimentation regarding this issue.

Limess commented 5 years ago

Reading more into this I think this makes complete sense for staging tables anyway, what we really want is the initial table sync to not use a staging table and do a direct copy with COMPUPDATE unset. This is probably not a priority for this library at this time as the initial sync when using Pipelinewise directly is fast-sync.

Longer term we may look at adding this here or switching to using pipelinewise rather than this target independently.

koszti commented 5 years ago

The latest 1.0.7 version released to PyPI and gives the option to override the default COPY commands and to remove COMPUPDATE OFF. You need to add the copy_option key to the config.json and add the values only that you specifically need. Doc is at https://github.com/transferwise/pipelinewise-target-redshift#configuration-settings

Like you said this library primarily used to load data to staging and leaving COMPUPDATE OFF probably makes sense and COMPUPDATE is unset in pipelinewise fast-sync by default.