practo / tipoca-stream

Near real time cloud native data pipeline in AWS (CDC+Sink). Hosts code for RedshiftSink. RDS to RedshiftSink Pipeline with masking and reloading support.
https://towardsdatascience.com/open-sourcing-tipoca-stream-f261cdcc3a13
Apache License 2.0
47 stars 5 forks source link

Schema migration fails due to column order issue due to extra columns #244

Closed alok87 closed 3 years ago

alok87 commented 3 years ago

Masking has a feature which allows to add new extra columns in Redshift tables. These columns are only present in Redshift and not at source, as they are computed columns required for aggregate analysis on sensitive columns.

One such feature is length_keys

Recently in one of such table using that feature we saw the new columns were added in source, the change could not be applied in Redshift. Following errors occured was seen in the loader logs:

[sarama] 2021/07/03 12:41:50 kafka: error while consuming loader-09a9e4-ts.customers.inventory/0: Schema migration failed, err: 1 error occurred:
    * table: customers mismatch col: new_col_name, prop: Name, input: new_col_name, target: text_length

(actual table names have been changed above)^ Conflict happens as the source has new_col_name coming before. But the target has an extra columntext_length at that position.

I remember I had solved for this use case https://github.com/practo/tipoca-stream/issues/126. But I see it has happened again. Need to relook and fix the problem.

Solution proposed

Need to have extra columns as first columns in the table instead of they being the last columns. When this starts happening we won't get order issues when new columns are added to a table at source. It would also work when a new extra column is added because that would be a recreate of table due to reload.

alok87 commented 3 years ago

One issue found

https://github.com/practo/tipoca-stream/blob/387734a193fb69d70acbe1a530aacb23901631af/pkg/redshift/redshift.go#L1103-L1104

I0703 13:36:46.282035    4794 redshift.go:1105] Missing column: new_col_name, add column will run.
I0703 13:36:46.282056    4794 redshift.go:1105] Missing column: text_length, add column will run.

ALTER TABLE ADD COLUMNS is being constructed by the migration for the extra columns which are not present in source. This is because it only checks if the column needs to be added based on the index number. It should ignore extra columns instead.

alok87 commented 3 years ago

Full table reload is required in this case. Also the full table reload should happen as reload group reload.

alok87 commented 3 years ago

Solution should consider keeping tables with extra columns as first columns so that there is no migration problem due to new columns being added at source.

And if any new extra column is added in masking, it will any way recreate the table in reload group, which already works!