Closed pvieito closed 2 years ago
Sorry, this took a while getting back to. I've been testing this and had a few observations.
ignoreEmpty
as false, then you see the value 0 for number.But there is no way to know whether the original uploaded data value was empty or NULL. And since Athena itself supports both for all fields, including number (which is an anti pattern), then I think Athena-express shouldn't (and cannot accurately) correct that and therefore present that data as-is.
Let me know what you think
If we leave the code "as is" it crashes with queries like this:
SELECT TRUE AS "BoolColumn"
UNION ALL
SELECT NULL
UNION ALL
SELECT FALSE
If the column is of Bool or Numeric it should parse an empty string as a NULL value (it cannot be an empty string).
How will we distinguish if the original uploaded value was meant to be empty or null, given Athena doesn't distinguish between the two? Feel like we might be making a leap here with polluting the response since Empty and Null are not the same values.
If the column has been marked by Athena as boolean
it cannot contains empty values, only true
, false
or null
. The same for numeric
columns. Only varchar
columns have this ambiguity, but typically data libraries like Pandas infer null
for empty values when reading CSV files.
That's what I was showing with the screenshot I posted above. Athena allows empty values for Boolean and Numeric columns. Upload a sample dataset with empty and null values for Athena and you will notice both are allowed and both show up as empty.
I does not. It is interpreting them as NULL values and showing them as NULL. Try this in your example dataset:
SELECT *
FROM table
WHERE boolean is NULL
And obviously this is allowed:
SELECT TRUE
UNION ALL
SELECT NULL
UNION ALL
SELECT FALSE
But this is not:
SELECT TRUE
UNION ALL
SELECT ''
UNION ALL
SELECT FALSE
-- SYNTAX_ERROR: line 4:5: column 2 in UNION query has incompatible types: boolean, varchar(4)
You're right in that Athena interprets uploaded empty values for Boolean/Number as NULL.
With your PR, all fields (not just Boolean and number) that are empty will become null. And that will ignore the genuine empty values
Yes, but note that this is the expected behavior in CSV parsers. Even Athena does this, if you upload a dataset with some empty values in a text
column I will infer NULL values:
SELECT *
FROM table
WHERE text is NULL
So you're suggesting replacing all empty values with NULL. Is there no use case where empty values could stay as empty values?
Yes, that is the typical behavior of CSV/Athena parsers. At least boolean
and numeric
values are currently being incorrectly handled (they cannot contain empty strings). While varchar
values can contain empty strings there is no way to distinguish between it and a NULL value so one should "win" (I personally prefer NULL but we could change the PR to return ''
by default for varchar
values).
ok I'll accept the PR
seems to work as expected. Are you noticing any issues?
On Sat, Jan 22, 2022 at 1:12 AM João Dias @.' via 33Mail @.> wrote:
This email was sent to the alias @.' by ' @.', and 33Mail forwarded it to you. To block all further emails to this alias click here http://www.33mail.com/alias/unsub/41bdfade131b958a27a3e30e285264d5
@.**** commented on this pull request.
In lib/helpers.js https://github.com/ghdna/athena-express/pull/72#discussion_r790109024:
- case "boolean":
- updatedObjectWithDataType[key] = JSON.parse(
- input[key].toLowerCase()
- );
- break;
- case "integer":
- case "tinyint":
- case "smallint":
- case "int":
- case "float":
- case "double":
- updatedObjectWithDataType[key] = Number(input[key]);
- break;
- default:
- updatedObjectWithDataType[key] = input[key];
- if (!input[key]) {
What if input[key] is false?
— Reply to this email directly, view it on GitHub https://github.com/ghdna/athena-express/pull/72#pullrequestreview-860173549, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGRKZXMWZ2FMLYKM5IK6WLUXJDD7ANCNFSM5IT6QWCA . You are receiving this because you modified the open/close state.Message ID: @.***>
@ghdna no, sorry. I just misread something, nvm
Without this fix when reading a boolean column with a NULL value ("" in the input CSV) it would fail to parse as JSON and in numeric columns the NULL value would be coerced to a 0. Now when an empty string is detected it is correctly handled as a NULL value.