hellonarrativ / spectrify

Export Redshift data and convert to Parquet for use with Redshift Spectrum or other data warehouses.
https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/
MIT License
116 stars 25 forks source link

Timestamp Overruns #8

Open andrewgross opened 6 years ago

andrewgross commented 6 years ago

In my work on json2parquet I ran in to some issues with far-future timestamps. If a timestamp exceeds 2262-04-11 23:47:16.854775807 it will overflow the timestamp[ns] type in Pandas, which will make it impossible to convert to a format that PyArrow can consume to create the deprecated int96 format used by Spark/Presto/Impala.

Not sure what the best solution is in this case. I did some work here to replace it with Timestamp.max if it exceeded that date, but that is definitely a tradeoff. Maybe there is a better way to convert timestamps so that you can accurately use the int96 converter.

c-nichols commented 6 years ago

Using microsecond or millisecond precision would allow a much larger time range. I do not believe that arrow/parquet-cpp support conversion from those formats to int96; should support become available for those types/operations, I'd be happy to integrate it :)

Another option is to make the field a string; this makes partitioning/querying much more difficult however.

c-nichols commented 6 years ago

One more note -- int96 is deprecated; it seems likely that at some point Spectrum will support 64-bit timestamps directly. I have no idea what the timeline on that might be, though.

andrewgross commented 6 years ago

Agreed. I don't actually need ns resolution, but it is currently the only way to get int96 timestamps. Probably better to push on getting ms/us support for that instead.

Agreed re: int96 deprecation

c-nichols commented 6 years ago

@andrewgross int64 is now supported for timestamp columns https://forums.aws.amazon.com/thread.jspa?threadID=257362

I would like to transition Spectrify to int64 + us for the next major version -- I think that would address a larger set of use cases.

andrewgross commented 6 years ago

Agreed, int96 is deprecated. I need to do some testing on support for tables that have a mix of datafiles with int96 or int64 before I can upgrade.

asokratis commented 5 years ago

The only way on my end to handle timestamp overruns is to use milliseconds instead of nanoseconds and put a default value in case the parsing to iso date errors out (like the year in YYYY-MM-DD format outputs as three digits with no leading zeros instead of four digits if year is under 1000 which becomes invalid). This may be useful if you are working with a lot of dirty data (so I think this will be useful for many use cases as redshift accepts those date ranges in their timestamp and date columns). Don't have the time to QA on this, but being the first time converting files to parquet, I spend more than half of the time in here debugging with timestamps than implementing my cron scripts.