hellonarrativ / spectrify

Export Redshift data and convert to Parquet for use with Redshift Spectrum or other data warehouses.
https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/
MIT License
116 stars 25 forks source link

Snappy Compression Support #38

Open jcannelos opened 6 years ago

jcannelos commented 6 years ago

Hi all, this is a great product and has saved us a ton of time on our Redshift to Spectrum transition. Question: Is it possible to store the Parquet files in snappy format, rather than gzip? I can see in Writer._get_writer where it's being specified as gzip. Do I have to sub-class Writer, then CsvManifestConverter and ConcurrentManifestConverter in order to specify snappy or is there a simpler way?

Thanks!

Sincerely,

J'son

c-nichols commented 6 years ago

Hi J'son :)

Happy you've found Spectrify useful. Regarding your question -- I think that's the easiest way right now... if it's any comfort, it used to be significantly more difficult, so it's at least it's trending in the right direction!

The default is gzip because in the benchmarks I performed:

Either/both of those may have been artifacts of our configuration, or may have changed since those tests (last October).

Ways forward:

Thanks, Colin