Using a single Spectrum table

hellonarrativ / spectrify

Export Redshift data and convert to Parquet for use with Redshift Spectrum or other data warehouses.

https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/

MIT License

116 stars 25 forks source link

Using a single Spectrum table #45

Closed sjkelleyjr closed 5 years ago

sjkelleyjr commented 5 years ago

I have a redshift cluster that creates a new table everyday. I'm concerned about the redshift table count limits and I'd like to archive old tables to avoid this limit. What I'd like to do is create one external spectrum table where I can query all of this archived data.

For example, let's say I have tables A B C D E created sequentially in time (say A was on day 1, B was on day 2, and so on.) I'd like to archive say A, B, and C to s3 and use 1 unified spectrum table to query data that was in these three tables.

Is this feasible? Is it feasible with spectrify? If so, is there code I can contribute to make this happen?

c-nichols commented 5 years ago

Yes this is feasible, and spectrify can help. For each table, you can use the “export” command followed by the “convert” command; the result will be each table’s data in parquet format in separate folders.

Then you’d want to run the “create_table” command and pass it one of the redshift tables+corresponding parquet folder. This will give you a spectrum table for just that daily table.

The last step is to add the rest of the daily parquet data using partitions. This isn’t currently built into spectrify but it’s easy to extend the tool to do so. I think there is another issue about this, I’ll link if I can find (on mobile).

Does that help?

colin

c-nichols commented 5 years ago

@sjkelleyjr see here https://github.com/hellonarrativ/spectrify/issues/7#issuecomment-404653534

sjkelleyjr commented 5 years ago

I think this works. I'll let you know if I have any more questions. Thanks Colin!