spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
135 stars 62 forks source link

Preserve table permissions on SaveMode.overwrite #126

Open lakshman-kollipara opened 1 year ago

lakshman-kollipara commented 1 year ago

Hi

We have a use case of overwriting data in existing Redshift table. Using "SaveMode.Overwrite" is dropping table metadata and hence losing all the table permissions. We cannot really use the workaround of truncating the table in preActions and populate the table because the current implementation runs each query in preActions and postActions in separate individual transactions by setting auto-commit to true.

Do we have any alternative to preserve table permissions when using "SaveMode.Overwrite" or a way to run preActions and postActions within a single transaction block?

jsleight commented 1 year ago

Can you write the data into a staging table, then merge the staging table into your main table?

We have auto-commit as false though each of the pre-/post-Actions is executed serially via jdbc.

SaveMode.Overwrite drops the table if it exists, so makes sense that it would also drop the metadata. Saving the metadata before dropping it would be a nice improvement -- though that is kind of the staging table approach I guess?

lakshman-kollipara commented 1 year ago

The problem with staging table is same as mentioned earlier. If postActions are executing in individual transaction blocks, It could cause dirty-reads because Truncate happens in one transaction and loading from stage table happens in another transaction. And usually the INSERT into SELECT queries take significant amount of time for medium-to-huge datasets.

Currently, The autocommit is set to false before doRedshiftLoad and we are committing if the method didn't return any errors. Does that mean All the preActions, Drop table, createTable and postActions are running within a single transaction block?

jsleight commented 1 year ago

Caveat that I'm not being too familiar with the RedshiftWriter and the difference between the jdbc query execution and the transaction blocks is messing with me.

But from what I understand, we only do the commit once at the end -- so it should it be a single transaction block.