spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
137 stars 63 forks source link

[WIP] Provide spark catalog, dsv2 and use parquet for copy/unload #120

Closed parisni closed 1 year ago

parisni commented 1 year ago

This PR:

  1. merge #70 for datasource v2 on master fixes #119
  2. spark catalog feature for reading and writing and DDL from spark sql to redshift (see readme.md) fixes #118
  3. a cache with TTL on s3 for each table (for analytics use cases) fixes #114
  4. fixes empty parquet files when no rows in redshift fixes #116
  5. provides copy from parquet fixes #117
  6. support redshift columns comments and faster tables schema discovery
parisni commented 1 year ago

found out 2 issues on this:

smoy commented 1 year ago

Because of an introduction of sensitive materials recently, I have to rewrite history using the procedure here: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository

This create a lot more conflict in this pull request. If this PR is still wanted, but probably open a new one instead.

In addition, the AWS contribution has brought along many improvement that included some of the intended features of this original PR. Check https://github.com/spark-redshift-community/spark-redshift/pull/128