transferwise / pipelinewise-target-redshift

Singer.io Target for Amazon Redshift - PipelineWise compatible
https://transferwise.github.io/pipelinewise/
Other
12 stars 65 forks source link

Allow suffixing columns with varying schema by type without timestamp #24

Open Limess opened 5 years ago

Limess commented 5 years ago

Currently we ingest a number of relatively loosely typed events using this target, intending to coalesce the types appropriately in the data warehouse layer, e.g using coalesce-fields from DBT (https://github.com/fishtown-analytics/stitch-utils).

This is possible when using target-redshift or an ETL service using it such as Stitch, however it doesn't work with this tap as the number of columns grows infinitely if the type of the column keeps changing.

For example, in our data warehouse raw tables which are imported using this target, we currently see:

properties__entity__sentiment__compound_20191002_1139
properties__entity__sentiment__neg_20191002_1139
properties__entity__sentiment__neu_20191002_1139
properties__entity__sentiment__pos_20191002_1139
properties__entity__sentiment__compound_20191002_1219
properties__entity__sentiment__neg_20191002_1219
properties__entity__sentiment__neu_20191002_1219
properties__entity__sentiment__pos_20191002_1227
properties__entity__sentiment__compound_20191002_1237
properties__entity__sentiment__neg_20191002_1237
properties__entity__sentiment__neu_20191002_1237
properties__entity__sentiment__pos_20191002_1237
properties__entity__sentiment__compound_20191002_1254
properties__entity__sentiment__neg_20191002_1254
properties__entity__sentiment__neu_20191002_1254
properties__entity__sentiment__pos_20191002_1254
properties__entity__sentiment__compound_20191002_1333
properties__entity__sentiment__neg_20191002_1310
properties__entity__sentiment__neu_20191002_1333
properties__entity__sentiment__pos_20191002_1333
properties__entity__sentiment__neg_20191002_1326
properties__entity__sentiment__neg_20191002_1333
properties__entity__sentiment__compound_20191002_1342
properties__entity__sentiment__neg_20191002_1350
properties__entity__sentiment__neu_20191002_1342
properties__entity__sentiment__pos_20191002_1342
properties__entity__sentiment__compound_20191002_1445
properties__entity__sentiment__neu_20191002_1445
properties__entity__sentiment__pos_20191002_1445
properties__entity__sentiment__neg_20191002_1359
properties__entity__sentiment__neg_20191002_1429
properties__entity__sentiment__neg_20191002_1445
properties__entity__sentiment__compound_20191002_1454
properties__entity__sentiment__neg_20191002_1501
properties__entity__sentiment__neu_20191002_1454
properties__entity__sentiment__pos_20191002_1454
properties__entity__sentiment__compound
properties__entity__sentiment__neu
properties__entity__sentiment__pos_20191005_0102
properties__entity__sentiment__neg_20191003_0102
properties__entity__sentiment__neg_20191004_0103
properties__entity__sentiment__neg_20191007_0101
properties__entity__sentiment__pos_20191006_0102
properties__entity__sentiment__pos
properties__entity__sentiment__neg

due to the source data flip-flopping between types. We're aware it would be beneficial to edit the data on the source, however our aim is to coalesce these columns downstream, rather than in our loading process, and fixing it in the source wouldn't prevent new problems arising in new events in future.

We'd benefit here from the approach of target-redshift of not versioning the columns, instead just adding the column datatype as a suffix.

Related to #19 as defining a type (varchar or string normally in these cases) would offer an alternative.