[WIP] Provide spark catalog, dsv2 and use parquet for copy/unload

spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark

Apache License 2.0

135 stars 62 forks source link

Closed parisni closed 1 year ago

parisni commented 1 year ago

This PR:

merge #70 for datasource v2 on master fixes #119
spark catalog feature for reading and writing and DDL from spark sql to redshift (see readme.md) fixes #118
a cache with TTL on s3 for each table (for analytics use cases) fixes #114
fixes empty parquet files when no rows in redshift fixes #116
provides copy from parquet fixes #117
support redshift columns comments and faster tables schema discovery

parisni commented 1 year ago

found out 2 issues on this:

spark parrallelism to read parquet files = files number. It makes performances bad for reading after the unload. Better to just read the unload folder and skip the manifest stuff
when the query is cancelled on redshift side, then no error occurs and the lib returns a dataframe with the current state of the content (which is not complete)

smoy commented 1 year ago

This create a lot more conflict in this pull request. If this PR is still wanted, but probably open a new one instead.

In addition, the AWS contribution has brought along many improvement that included some of the intended features of this original PR. Check https://github.com/spark-redshift-community/spark-redshift/pull/128