Closed mlavoie-sm360 closed 5 years ago
Thanks for the details, I'd take it as a bug.
Since target-redshift converts the standard singer JSON messages to the internal CSV files, every valid singer JSON message should be loaded automatically to Redshift without the manual fine tuning of the COPY command. And if required the COPY
command should be generated automatically at run time.
Your request of adding copy_options
still makes sense but I would try to fix it first without adding this feature in the first go.
Can you please share an example singer JSON (SCHEMA
and RECORD
singer message) that producing stl_load_error
?
Here is the stl_load_error
userid,slice,tbl,starttime,session,query,filename,line_number,colname,type,col_length,position,raw_line,raw_field_value,err_code,err_reason
"100","0","21418980","2019-09-10 19:15:45.400096","4959","10164782","s3://***/pipelinewise-redshift-imports/pipelinewise_crm.Activity_20190910-191524.csv.gz ","63596","employeeid ","numeric ","18, 0 ","95","70127,3,2011-12-07 14:43:28,2011-12-07 14:43:28,called no answer and no answering machine\\,135,2011-12-07 14:43:28,5,8574,7,3,135,-1,4,0,9924,,0,,,,,,2019-09-10 17:50:57,2019-09-10 17:50:57, ","2011-12-07 14:43:28 ","1207","Invalid digit, Value '-', Pos 4, Type: Decimal "
Not sure how I would get the singer JSON messages
Based on the filename pattern pipelinewise_crm.Activity_20190910-191524.csv.gz
, looks like your CSV is not generated by pipelinewise-target-redshift
but by one of built-in FastSync components. FastSync is used to do initial quick syncs from source to target whenever it's possible. More info about fastsync at https://transferwise.github.io/pipelinewise/concept/fastsync.html
Can you please confirm if your source db is mysql or postgres? If yes can you please tell which one of the two and I will try to reproduce the issue? I have some ideas what's going on.
Every component should generate the CSV files in the same format, and every component should use the same compatible COPY command and in that case we don't need to specify custom copy_options
. We need to modify one ore more of these places:
Yes, our source db is mysql.
The issue that causing the failed load has been fixed by https://github.com/transferwise/pipelinewise/commit/9905bff09d929629ac2ceced642628542e848c31 in the fastsync functionality and merged into the main https://github.com/transferwise/pipelinewise repository.
Please upgrade to the latest (0.10.2) PipelineWise and it should work.
Special characters were not handled correctly and the \
char in your data failed to load. Adding custom settings to the COPY command is not implemented by this fix because the built-in settings should work on every data without any manual fine tuning.
Awesome thank you @koszti. We just ran some tests and we got another error although this one is not linked to special characters, I feel it is still related, perhaps not, let me know if you think I should open a separate issue on this:
colname: description
type: varchar
col_length: 65535
err_code: 1204
err_reason: String length exceeds DDL length
The row/column giving this error is ~71K characters long and is of type MEDIUMTEXT
in the source table.
A TRUNCATECOLUMNS
on load would fix this.
thanks for the feedback. I think I know what is this. I’ll reproduce on my end and will send some updates soon.
This has been fixed by https://github.com/transferwise/pipelinewise-target-redshift/commit/3ccf2df73e03be5bedeea265b11681a1a233af41 and included in the the 1.0.7 pypi package.
Also, it has been bumped in the main PipelineWise repo. Example YAML is in the doc at https://transferwise.github.io/pipelinewise/connectors/targets/redshift.html
Also, TRUNCATECOLUMNS
is now included in the default COPY command so maybe you don't need to specify explicitly.
Describe the feature
Provide a way to customize the data format and conversion parameters of the
COPY
statement in the config file.Describe alternatives you've considered
We could fork the repo and manually modify
db_sync.py
, this seems counter productive.We could 'fix' the
stl_load_errors
in the source data, but this is not always an available option.Additional context
/target_redshift/db_sync.py
-~ line 372
When loading data into Redshift using the COPY` statement, there are a series of data format and conversion parameters that can be used. Currently, pipelinewise-target-redshift uses the following.This is limiting if your source data requires you to use a different set of conversion parameters in order to properly execute the
COPY
statement. For example, our dataset throws astl_load_error
for the pipelinewiseCOPY
statement but for which the following options have worked for years now. We have tested them using the CSV generated by pipelinewise and the data loads properly.Different projects may require different configurations from time to time, it would be nice to provide the flexibility to customize the
COPY
statement parameters easily.Who will this benefit?
This would benefit every project that uses the redshift target. Assuming the default configuration works for most projects, it would require no additional configuration initially. When you do get in a project that requires further fine tuning, the option is readily available.