laughingman7743 / PyAthena

PyAthena is a Python DB API 2.0 (PEP 249) client for Amazon Athena.
MIT License
463 stars 105 forks source link

PandasCursor converts NULL values in Sttring columns to empty String #168

Open jurgispods opened 4 years ago

jurgispods commented 4 years ago

In #117, a fix was made for NULL results not being returned by PandasCursor. This now work as expected:

> ret = conn.cursor(PandasCursor).execute("select * from (values (1), (NULL))").fetchall()
> ret
[(1,), (<NA>,)]
> [pd.isna(x[0]) for x in ret]
[False, True]

However, NULL values for String columns are secretly converted to empty Strings:

> ret = conn.cursor(PandasCursor).execute("select * from (values ('bla'), (NULL))").fetchall()
> ret
[('bla',), ('',)]
> [pd.isna(x[0]) for x in ret]
[False, False]

Is this the expected behaviour? I believe NULL should always be converted to NaN, regardless of na_values or keep_default_na.

laughingman7743 commented 4 years ago

ret = conn.cursor(PandasCursor).execute("select * from (values (1), (NULL))").fetchall() ret [('bla',), ('',)] [pd.isna(x[0]) for x in ret] [False, False]

It seems that the query in this example is wrong. In this case, what kind of query are you executing? It may be the same issue as #118.

jurgispods commented 4 years ago

@laughingman7743 Apologies, there was a copy-paste error in my second query. I've edited the original post.

jurgispods commented 4 years ago

And regarding #118: I understand that I can use na_values='' to force converting the respective date in the Athena-generated CSV to NaN. But this means losing the possibility to distinguish between NULL and the empty String in the original Athena table.

I guess this cannot be circumvented, as the information is already lost in the Athena-generated CSV?

laughingman7743 commented 4 years ago

If you check the CSV file output from Athena, you'll see that empty characters are double-quoted, while null characters are not.

"_col0"
"blah"

""

If you can handle them well, you should be able to identify null and empty characters. But I don't know how to configure the read_csv method's options to better read this CSV.

jurgispods commented 4 years ago

I see. One approach would be to disable quoting and remove the quotes afterwards. With your example in mind:

import pandas as pd
import csv

df = pd.read_csv('myfile.csv', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
# remove quotes from String columns manually
df.select_dtypes([object]).apply(lambda col: col.str[1:-1])

This is not very elegant, but if CSVs written by Athena are guaranteed to contain quotes Strings, this should always work. What do you think?