hellonarrativ / spectrify

Export Redshift data and convert to Parquet for use with Redshift Spectrum or other data warehouses.
https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/
MIT License
116 stars 25 forks source link

Parquet conversion #56

Closed BhuviTheDataGuy closed 4 years ago

BhuviTheDataGuy commented 5 years ago

Hey, this tool is amazing and simplified data engineer's life.

Im trying to understand the principles of this tool.

It's converting the CSV to Parquet, just curious how it's doing this process without any Hadoop clusters?

c-nichols commented 4 years ago

Hi Bhuvi,

Spectrify uses the Apache Arrow project to write Parquet files. Behind the scenes, Arrow uses the Apache-managed C++ parquet writer, parquet-cpp.

More info here: https://arrow.apache.org/docs/python/parquet.html