Why is COMPUPDATE explicitly disabled

Limess commented 5 years ago

General question as to why COMPUPDATE is off in the Redshift COPY command.

After running some tables from postgres to pipelinewise-target-redshift I've checked the result compression using ANALYZE_COMPRESSION with this output:

Table	Column	Encoding	Est_reduction_pct
raw_table		zstd	89.54
raw_table		zstd	89.54
raw_table		zstd	89.08
raw_table		zstd	88.84
raw_table		zstd	88.36
raw_table		zstd	88.16
raw_table		zstd	88.04
raw_table		zstd	83.90
raw_table		zstd	75.15
raw_table		zstd	72.93
raw_table		zstd	67.32
raw_table		zstd	64.90
raw_table		zstd	63.35
raw_table		zstd	61.31
raw_table		zstd	55.30
raw_table		zstd	54.52
raw_table		zstd	53.87
raw_table		zstd	50.25
raw_table		zstd	48.17
raw_table		zstd	47.91
raw_table		zstd	44.15
raw_table		zstd	42.95
raw_table		zstd	42.73
raw_table		zstd	40.92
raw_table		zstd	39.82
raw_table		zstd	38.98
raw_table		zstd	38.87
raw_table		zstd	38.87
raw_table		zstd	35.16
raw_table		zstd	34.74
raw_table		zstd	34.58
raw_table		zstd	34.56
raw_table		zstd	21.42
raw_table		zstd	20.73
raw_table		zstd	20.65
raw_table	_sdc_deleted_at	lzo	0.00
raw_table	_sdc_extracted_at	raw	0.00
raw_table		raw	0.00

I'm unsure if enabling COMPUPDATE to apply compression on the staging table would then copy encodings across to the final table or whether they're discarded hence it being disabled here? Does enabling compression slow down loads?

Is there another alternate way of applying compression to the final raw tables generated by this target?

Limess commented 5 years ago

Ran a quick test with COMPUPDATE on based off a fork and there was no difference to the output of ANALYZE compression for the test tables.

koszti commented 5 years ago

The original idea of adding COMPUPDATE OFF is based on a not too recent blog post at https://www.flydata.com/blog/how-to-improve-performance-upsert-amazon-redshift/ and maybe it's outdated.

At the moment I'm working on https://github.com/transferwise/pipelinewise-target-redshift/issues/8 that will give an option to overwrite the built-in COPY options. Do you think that would be helpful for you as well so you can add/remove extra options to COPY as you wish?

Limess commented 5 years ago

Thanks for the info and reference

I think https://github.com/transferwise/pipelinewise-target-redshift/issues/8 would be sufficient for further tuning or experimentation regarding this issue.

Limess commented 5 years ago

Reading more into this I think this makes complete sense for staging tables anyway, what we really want is the initial table sync to not use a staging table and do a direct copy with COMPUPDATE unset. This is probably not a priority for this library at this time as the initial sync when using Pipelinewise directly is fast-sync.

Longer term we may look at adding this here or switching to using pipelinewise rather than this target independently.

koszti commented 5 years ago

The latest 1.0.7 version released to PyPI and gives the option to override the default COPY commands and to remove COMPUPDATE OFF. You need to add the copy_option key to the config.json and add the values only that you specifically need. Doc is at https://github.com/transferwise/pipelinewise-target-redshift#configuration-settings

Like you said this library primarily used to load data to staging and leaving COMPUPDATE OFF probably makes sense and COMPUPDATE is unset in pipelinewise fast-sync by default.

transferwise / pipelinewise-target-redshift

Why is COMPUPDATE explicitly disabled #16