hellonarrativ / spectrify

Export Redshift data and convert to Parquet for use with Redshift Spectrum or other data warehouses.
https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/
MIT License
116 stars 25 forks source link

When unloading dates before the year "1000" CSV parsing errors. #53

Open troxil opened 5 years ago

troxil commented 5 years ago

Description

Unloading a DATE column that has dates before the year 1000 causes failures.

CREATE TABLE IF NOT EXISTS public.my_dodgy_test (
  fun_date DATE
);
INSERT INTO public.my_dodgy_test VALUES ('0030-01-01');
SELECT fun_date from public.my_dodgy_test;

I know this issue is due to the UNLOAD doesn't pad the leading 0's in when writing the CSV correctly. but I'm unsure how to deal with this?

Source of the problem is in here https://github.com/hellonarrativ/spectrify/blob/master/spectrify/utils/timestamps.py#L35

What I Did

"""
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/Users/myself/dev/spectrify/.venv/lib/python3.7/site-packages/spectrify/convert.py", line 215, in _parallel_wrapper
    CsvConverter(sa_table, s3_config, delimiter, escapechar, quoting, unicode_csv).convert_csv(data_path)
  File "/Users/myself/dev/spectrify/.venv/lib/python3.7/site-packages/spectrify/convert.py", line 120, in convert_csv
    for chunk in self.columnar_data_chunks(file_path, self.sa_table, SPECTRIFY_ROWS_PER_GROUP):
  File "/Users/myself/dev/spectrify/.venv/lib/python3.7/site-packages/spectrify/convert.py", line 168, in columnar_data_chunks
    value = self._convert_to_type(value, py_type)
  File "/Users/myself/dev/spectrify/.venv/lib/python3.7/site-packages/spectrify/convert.py", line 143, in _convert_to_type
    value = py_type(value)
  File "/Users/myself/dev/spectrify/.venv/lib/python3.7/site-packages/spectrify/utils/timestamps.py", line 36, in iso8601_to_days_since_epoch
    dt = ciso8601.parse_datetime(date_str)
ValueError: Invalid character while parsing year ('-', Index: 2)
"""
c-nichols commented 5 years ago

oof, this is a tough one. Does ISO-8601 specify what to do for non-4-digit years?

c-nichols commented 5 years ago

(Inadvertent button push lol)

ms32035 commented 5 years ago

I experienced the same issue with negative years (unloaded as -1), so please bear that edge case in mind too when thinking about solutions

hawktang commented 5 years ago

If you want to delete it, just follow customize section of the document to rewire query to filter out the records.

hawktang commented 5 years ago

I encounter it long time ago with even BC years