hellonarrativ / spectrify

Export Redshift data and convert to Parquet for use with Redshift Spectrum or other data warehouses.
https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/
MIT License
116 stars 25 forks source link

How to track down type conversion issues #47

Closed michael-erasmus closed 5 years ago

michael-erasmus commented 5 years ago

Description

Hey there! I really like this project, we're actually looking into using it to help us migrate from Redshift to BigQuery.

I am however running into issues when trying to convert some records from csv to parquet. I'm seeing an error like this:

"""
Traceback (most recent call last):
  File "/home/michael/.pyenv/versions/3.6.7/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/michael/.pyenv/versions/3.6.7/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/michael/.local/share/virtualenvs/spectifry-f8J1UpGn/lib/python3.6/site-packages/spectrify/convert.py", line 194, in _parallel_wrapper
    CsvConverter(sa_table, s3_config).convert_csv(data_path)
  File "/home/michael/.local/share/virtualenvs/spectifry-f8J1UpGn/lib/python3.6/site-packages/spectrify/convert.py", line 108, in convert_csv
    for chunk in self.columnar_data_chunks(file_path, self.sa_table, SPECTRIFY_ROWS_PER_GROUP):
  File "/home/michael/.local/share/virtualenvs/spectifry-f8J1UpGn/lib/python3.6/site-packages/spectrify/convert.py", line 157, in columnar_data_chunks
    value = self._convert_to_type(value, py_type)
  File "/home/michael/.local/share/virtualenvs/spectifry-f8J1UpGn/lib/python3.6/site-packages/spectrify/convert.py", line 131, in _convert_to_type
    value = py_type(value)
  File "/home/michael/.local/share/virtualenvs/spectifry-f8J1UpGn/lib/python3.6/site-packages/spectrify/convert.py", line 48, in postgres_bool_to_python_bool
    raise ValueError("Unknown boolean value {}".format(val))
ValueError: Unknown boolean value Asia/Calcutta
"""

This is happening on an export of a couple of millions of rows.

Now I'm sure this might be an error cropping up with some weird characters in the CSV export that might be interpreted as the delimiter character (I see this kind of stuff a lot), but my question is more around what would be the best way to help troubleshoot/debug which file and line number the error is originating from (which I can't unfortunately tell from just looking at the stacktrace)

I've started looking around the code and my hunch is to perhaps subclass the CsvConverter class and add some exception handling code where I could maybe annotate the exception with more information, but I wanted to perhaps ask for some guidance/advice on if there might be an easier way?

c-nichols commented 5 years ago

Yes that makes sense to me. You might also try some of the techniques described in #6 related to escaping. I guess I've been lucky, I haven't encountered issues like this... let us know what you find!

c-nichols commented 5 years ago

Also, happy you're finding a use case outside of redshift --> spectrum. I'm curious, what are the steps you're taking to get your data into BigQuery? Can BigQuery accept parquet files, or are you working with CSV only?

michael-erasmus commented 5 years ago

Hey @c-nichols!

Thanks for the advice! I'll close this issue for now. I'll share any updates on what I learn here!

Also, happy you're finding a use case outside of redshift --> spectrum. I'm curious, what are the steps you're taking to get your data into BigQuery? Can BigQuery accept parquet files, or are you working with CSV only?

We're loading data into BigQuery using Parquet. So I'm using Spectrify to unload the data to csv and convert it to Parquet. From there I use Google Storage Transfer Service to move the Parquet files into Google Storage from S3 and then finally I can load the Parquet files into BigQuery. Using this flow has two benefits over just using csv.

  1. Spectrify can use the table schema to infer the schema of data in the Parquet files which can be annotated as metadata on the .parq files. This is much less error-prone than BigQuery trying to infer a schema from csv files.
  2. Parquet files are much smaller than csv files so it costs less to transfer them from S3 to GCS.

Hope that makes sense!

andrewgross commented 5 years ago

Be careful of any unicode strings that have the null character in them \u0000. Redshift has issues exporting this value which can lead to offset columns and corrupted data. While they claim they don't support them, it is possible to introduce them when loading JSON data (though not CSV in my experience). Other than that, making sure you are quoting every field, escaping characters, and using custom NULL values has gotten me a lot of mileage.