spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
135 stars 62 forks source link

[WIP] Issue #69: Support for DSV2 #70

Closed karuppayya closed 1 year ago

karuppayya commented 4 years ago

Whats new

  1. Uses DSV2 apis of Spark 3.0
  2. Leverage the ability of Redshift to unload data in Parquet format
    • Avoid row wise type conversion done for CSV. This gives good performance improvement(Benchmark yet to be done)
    • Columnar reads
  3. Use the Spark's DSV2 apis of CSV, Parquet for reads. (No need for custom reader)

Note: this PR adds only read support, since DSV2 write support is broken

smoy commented 4 years ago

I will start taking a look this week or next week. My organization hasn't prepped for Spark 3.0, so its in a lot of flux. I am setting up my personal development machine to check this out.

karuppayya commented 4 years ago

@smoy @lucagiovagnoli I have fixed compilation errors and ran tests. Given that 3.0.0 is out, can you please review the changes, when you get some cycles. We can also come up with a plan on how to make sure the code is compatible with 2.x and 3.x(currently the repo has become incompatible with my changes. We can fix them through the review process). Thanks

gfeldman commented 3 years ago

Anything I can do to help?

karuppayya commented 3 years ago

Thanks @gfeldman. Any help with the review is much appreciated.

smoy commented 1 year ago

close since it has been inactive for some time.