Ability for custom configurations in the COPY statement

mlavoie-sm360 commented 5 years ago

Describe the feature

Provide a way to customize the data format and conversion parameters of the COPY statement in the config file.

---

# ------------------------------------------------------------------------------
# General Properties
# ------------------------------------------------------------------------------
id: "redshift"                        # Unique identifier of the target
name: "Amazon Redshift"               # Name of the target
type: "target-redshift"               # !! THIS SHOULD NOT CHANGE !!

# ------------------------------------------------------------------------------
# Target - Data Warehouse connection details
# ------------------------------------------------------------------------------
db_conn:
  host: "xxxxx.redshift.amazonaws.com"          # Redshift host
  port: 5439                                    # Redshift port
  user: "<USER>"                                # Redshift user
  password: "<PASSWORD>"                        # Plain string or vault encrypted
  dbname: "<DB_NAME>"                           # Redshift database name

  # We use an intermediate S3 to load data into Redshift
  aws_access_key_id: "<ACCESS_KEY>"             # S3 - Plain string or vault encrypted
  aws_secret_access_key: "<SECRET_ACCESS_KEY>"  # S3 - Plain string or vault encrypted
  s3_bucket: "<BUCKET_NAME>"                    # S3 external bucket name
  s3_key_prefix: "redshift-imports/"            # Optional: S3 key prefix

  # COPY options
  copy_options:                                 # COPY formatting and conversion options
    - "DELIMITER ',' REMOVEQUOTES ESCAPE"
    - "BLANKSASNULL TIMEFORMAT 'auto'"
    - "COMPUPDATE OFF STATUPDATE OFF"

Describe alternatives you've considered

We could fork the repo and manually modify db_sync.py, this seems counter productive.

We could 'fix' the stl_load_errors in the source data, but this is not always an available option.

Additional context

/target_redshift/db_sync.py - ~ line 372 When loading data into Redshift using the COPY` statement, there are a series of data format and conversion parameters that can be used. Currently, pipelinewise-target-redshift uses the following.

DELIMITER ',' REMOVEQUOTES ESCAPE
BLANKSASNULL TIMEFORMAT 'auto'
COMPUPDATE OFF STATUPDATE OFF

This is limiting if your source data requires you to use a different set of conversion parameters in order to properly execute the COPY statement. For example, our dataset throws a stl_load_error for the pipelinewise COPY statement but for which the following options have worked for years now. We have tested them using the CSV generated by pipelinewise and the data loads properly.

csv
EMPTYASNULL
TRIMBLANKS
FILLRECORD
TRUNCATECOLUMNS
GZIP

Different projects may require different configurations from time to time, it would be nice to provide the flexibility to customize the COPY statement parameters easily.

Who will this benefit?

This would benefit every project that uses the redshift target. Assuming the default configuration works for most projects, it would require no additional configuration initially. When you do get in a project that requires further fine tuning, the option is readily available.

koszti commented 5 years ago

Thanks for the details, I'd take it as a bug.

Since target-redshift converts the standard singer JSON messages to the internal CSV files, every valid singer JSON message should be loaded automatically to Redshift without the manual fine tuning of the COPY command. And if required the COPY command should be generated automatically at run time.

Your request of adding copy_options still makes sense but I would try to fix it first without adding this feature in the first go.

Can you please share an example singer JSON (SCHEMA and RECORD singer message) that producing stl_load_error?

mlavoie-sm360 commented 5 years ago

Here is the stl_load_error

userid,slice,tbl,starttime,session,query,filename,line_number,colname,type,col_length,position,raw_line,raw_field_value,err_code,err_reason
"100","0","21418980","2019-09-10 19:15:45.400096","4959","10164782","s3://***/pipelinewise-redshift-imports/pipelinewise_crm.Activity_20190910-191524.csv.gz                                                                                                                                                              ","63596","employeeid                                                                                                                     ","numeric   ","18, 0     ","95","70127,3,2011-12-07 14:43:28,2011-12-07 14:43:28,called no answer and no answering machine\\,135,2011-12-07 14:43:28,5,8574,7,3,135,-1,4,0,9924,,0,,,,,,2019-09-10 17:50:57,2019-09-10 17:50:57,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ","2011-12-07 14:43:28                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             ","1207","Invalid digit, Value '-', Pos 4, Type: Decimal                                                      "

Not sure how I would get the singer JSON messages

koszti commented 5 years ago

Based on the filename pattern pipelinewise_crm.Activity_20190910-191524.csv.gz , looks like your CSV is not generated by pipelinewise-target-redshift but by one of built-in FastSync components. FastSync is used to do initial quick syncs from source to target whenever it's possible. More info about fastsync at https://transferwise.github.io/pipelinewise/concept/fastsync.html

Can you please confirm if your source db is mysql or postgres? If yes can you please tell which one of the two and I will try to reproduce the issue? I have some ideas what's going on.

Every component should generate the CSV files in the same format, and every component should use the same compatible COPY command and in that case we don't need to specify custom copy_options. We need to modify one ore more of these places:

FastSync - generating CSV from MySQL: here
FastSync - Generating CSV from Postgres: here
FastSync - Loading CSV to Redshift: here
Incremental or Log based target-redshift (this repo) - here

mlavoie-sm360 commented 5 years ago

Yes, our source db is mysql.

koszti commented 5 years ago

The issue that causing the failed load has been fixed by https://github.com/transferwise/pipelinewise/commit/9905bff09d929629ac2ceced642628542e848c31 in the fastsync functionality and merged into the main https://github.com/transferwise/pipelinewise repository.

Please upgrade to the latest (0.10.2) PipelineWise and it should work.

Special characters were not handled correctly and the \ char in your data failed to load. Adding custom settings to the COPY command is not implemented by this fix because the built-in settings should work on every data without any manual fine tuning.

mlavoie-sm360 commented 5 years ago

Awesome thank you @koszti. We just ran some tests and we got another error although this one is not linked to special characters, I feel it is still related, perhaps not, let me know if you think I should open a separate issue on this:

colname: description
type: varchar col_length: 65535 err_code: 1204 err_reason: String length exceeds DDL length

The row/column giving this error is ~71K characters long and is of type MEDIUMTEXT in the source table.

A TRUNCATECOLUMNS on load would fix this.

koszti commented 5 years ago

thanks for the feedback. I think I know what is this. I’ll reproduce on my end and will send some updates soon.

koszti commented 5 years ago

This has been fixed by https://github.com/transferwise/pipelinewise-target-redshift/commit/3ccf2df73e03be5bedeea265b11681a1a233af41 and included in the the 1.0.7 pypi package.

Also, it has been bumped in the main PipelineWise repo. Example YAML is in the doc at https://transferwise.github.io/pipelinewise/connectors/targets/redshift.html

Also, TRUNCATECOLUMNS is now included in the default COPY command so maybe you don't need to specify explicitly.

transferwise / pipelinewise-target-redshift