hellonarrativ / spectrify

Export Redshift data and convert to Parquet for use with Redshift Spectrum or other data warehouses.
https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/
MIT License
116 stars 25 forks source link

Add support for REAL data type. #41

Closed eliran-lightricks closed 5 years ago

eliran-lightricks commented 5 years ago

Trying to solve #37

andrewgross commented 5 years ago

Just a heads up, Redshift considers REAL to be FLOAT4 (or float32() in SQLAlchemy). I think if you use float64() you will have issues reading the data with Spectrum.

c-nichols commented 5 years ago

@eliran-lightricks from reading the documentation I think @andrewgross is correct, REAL columns should use type pa.float32().

Relevant documentation below -- note that REALs consume 4 bytes of space, indicating pa.float32() is sufficient.

https://docs.aws.amazon.com/redshift/latest/dg/r_Numeric_types201.html

Would you mind testing with pa.float32() and see how it goes?

eliran-lightricks commented 5 years ago

I did some tests and it seems to work with uploading to S3 and copying back. But for some reason the tests are failing, trying to figure out exactly where.

c-nichols commented 5 years ago

@eliran-lightricks try using self.assertAlmostEqual(self.data, written_data, places=5) I think it's just the floating point precision causing issues with strict comparison

eliran-lightricks commented 5 years ago

@c-nichols Had to do some alterations since self.assertAlmostEqual can't work directly on lists. Let me know if that's okay with you.

anomitra commented 5 years ago

@c-nichols Did you release this on PyPI? I pip install'd this today, and this error cropped up -- got fixed when I applied the diff from the PR.

c-nichols commented 5 years ago

Releasing is still on my todo list :see_no_evil: I will do so within the next couple days

anomitra commented 5 years ago

Please do! Don't make me monkeypatch this on production 😂

c-nichols commented 5 years ago

@anomitra new release is up on pypi. "a few" == 18 apparently