hellonarrativ / spectrify

Export Redshift data and convert to Parquet for use with Redshift Spectrum or other data warehouses.
https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/
MIT License
116 stars 25 forks source link

NULL value handling #6

Open woodlee opened 7 years ago

woodlee commented 7 years ago

This is awesome, thank you for putting this out there! I just recently finished writing something similar for use at my company, but hit enough snags with the Parquet conversion bit that we punted on that portion. Perhaps we'll be switching to this soon.

One issue we had to deal with was proper round-trip representation of null values, as distinct from empty strings. Based on a quick glance it doesn't look like you're doing anything here to deal with that, but please let me know if I missed something!

Redshift UNLOADs provide the NULL AS option to allow specifying how nulls should be output, but there's a Redshift bug (which was actually acknowledged by an Amazon employee in an AWS forum posting somewhere, though I don't have the link handy) such that when you use both the NULL AS and ESCAPE options, the ESCAPE gets applied second. So, you can't get Redshift to unload a value that is guaranteed to uniquely represent nulls. A lot of systems out there seem to use the convention of representing nulls as \N in text files, but because of the bug you can only get Redshift unloads that use ESCAPE to write N (obviously problematic) or \\N (which would probably be okay most of the time but still isn't a solid guarantee).

We got around this by using an UNLOAD query formatted like so:

    UNLOAD ('{inner_unload_query}')
    TO '{unload_path}'
    IAM_ROLE '{iam_role_arn}'
    MAXFILESIZE 1 gb
    DELIMITER '|' NULL AS '\\N' MANIFEST GZIP

...where the inner_unload_query portion mostly just SELECTs all the table's columns by name, but uses a clause like this for columns of type character and character varying:

REPLACE(REPLACE(REPLACE(REPLACE("{column_name}", '\\', '\\\\'), '|', '\\|'), '\n', '\\\n'), '\r', '\\\r')

Which is admittedly pretty ugly, but the upshot is that this allows you to avoid needing the ESCAPE option, such that the NULL AS '\\N' part actually works to produce an output file where nulls are represented by \N. Once you have that in place, everything downstream can be configured to recognize that and the distinction between nulls empty strings can be maintained.

Just throwing that at you in case you find it valuable! Putting it all together the whole relevant chunk from my code (any or all of which you should feel free to use) looks like:

    cols = []
    escape_character_col_spec = r"""
        REPLACE(REPLACE(REPLACE(REPLACE("{}", '\\', '\\\\'), '|', '\\|'), '\n', '\\\n'), '\r', '\\\r')
    """
    for col in source_schema:
        if col['base_type'].startswith('character'):
            cols.append(escape_character_col_spec.format(col['name']))
        else:
            cols.append(r'"{}"'.format(col['name']))
    cols = r','.join(cols)
    inner_unload_query = r'SELECT {cols} FROM "{schema}"."{table}"'.format(**locals())
    inner_unload_query = inner_unload_query.replace('\\', '\\\\').replace("'", r"\'")
    unload_query = r"""
        UNLOAD ('{inner_unload_query}')
        TO '{unload_path}'
        IAM_ROLE '{iam_role_arn}'
        MAXFILESIZE 1 gb
        DELIMITER '|' NULL AS '\\N' MANIFEST GZIP
    """.format(**locals())
    logger.debug('Unloading with query:%s', unload_query)
    conn.execute(unload_query)
c-nichols commented 7 years ago

This is fantastic information! Thank you for sharing all this. One of my goals for this project is 100% data integrity so I'd love to have this as an option if not the default.

naveenarumugam commented 6 years ago

@c-nichols I am glad that Spectrify is here.. helpful! Just wondering when this fix will be available; (sorry, if I've overlooked). Currently, the UNLOAD uses plain dump.
@woodlee Should I change export_to_csv method in export.py? with this snippet? Currently, I see UNLOAD_QUERY is outside the method. Will be helpful to re-use your code as I'm sure I'm going to hit the NULL/Timestamp issues. Appreciate!

formatted_query = UNLOAD_QUERY.format(table_name) with engine.connect() as cursor: click.echo('Exporting table to CSV...') cursor.execute(formatted_query, { 's3_path': s3_csv_path, 'credentials': creds_str, }) click.echo('Done.')

woodlee commented 6 years ago

@naveenarumugam I think something like that would be resonable; it's still working for us. Just keep in mind that my code snippet there is from an entirely different script I wrote, and not a modification to this library, so it'll certainly require some adjustment.

One thing that we have changed since I posted is to add a LEFT clause just inside the nasty series of REPLACEs like so:

REPLACE(REPLACE(REPLACE(REPLACE(LEFT("{}", 50000)...

...to help get around situations where the manual escaping causes the generated column's size to exceed Redshift's 64K max for varchars.

Good luck!

naveenarumugam commented 6 years ago

@woodlee Thank you. Will test it. Meanwhile, should I consider the snippet as entirety or if possible can you pls share the full script?

woodlee commented 6 years ago

@naveenarumugam Here you go!

c-nichols commented 6 years ago

I'll try to integrate this over the next few weeks, or at least make it easier to run the export with this query.

I'm also considering a rewrite of a lot of the core functionality to use classes instead of nested functions -- idea being classes are easy to extend/modify via inheritance.

naveenarumugam commented 6 years ago

Thanks @woodlee
@c-nichols I just modified the code to add snappy compression (defaults to Gzip). Do I have write permission to create branch and checkin to this project? (trying to create PR). Please let me know.

c-nichols commented 6 years ago

Hi @naveenarumugam, please use a fork to open a PR. Following links provide some more info:

https://guides.github.com/activities/forking/ https://spectrify.readthedocs.io/en/stable/contributing.html#get-started

naveenarumugam commented 6 years ago

Sure, Thanks @c-nichols

picarus commented 5 years ago

Any progress on this feature request?

A related issue we are having is that we are finding that character strings such as empty strings ('') or space(' ') are also converted to NULL when we unload. Is this something that could be addressed ?

@woodlee , would be useful if you could provide the link to the AWS bug you mentioned in your first message.

I may say the problem we are trying to solve is different as we want to migrate data from Redshift to BigQuery. Any experience in this process? The two first steps of Spectrify are very useful so far.

c-nichols commented 5 years ago

To my knowledge no one is working on incorporating this currently. BigQuery auto importer has some annoying quirks/limitations... For the most part our data does not distinguish between null/empty, so I'm afraid I don't have much code to share.

Nihhaar commented 5 years ago

@woodlee Is the issue of ESCAPE being applied after NULL AS is still present in AWS?

woodlee commented 5 years ago

@Nihhaar sorry, I don't know. Haven't worked with it in a while.