ghdna / athena-express

Athena-Express can simplify executing SQL queries in Amazon Athena AND fetching cleaned-up JSON results in the same synchronous or asynchronous request - well suited for web applications.
https://www.npmjs.com/package/athena-express
MIT License
179 stars 70 forks source link

Fixed type validation with support for NULL #72

Closed pvieito closed 2 years ago

pvieito commented 2 years ago

Without this fix when reading a boolean column with a NULL value ("" in the input CSV) it would fail to parse as JSON and in numeric columns the NULL value would be coerced to a 0. Now when an empty string is detected it is correctly handled as a NULL value.

ghdna commented 2 years ago

Sorry, this took a while getting back to. I've been testing this and had a few observations.

  1. On Athena itself, when you run a query on data that contains empty and null numbers and boolean, the result shown is always empty on the console. See below. The same is reflected in the CSV file that Athena produces and stores in S3. Those fields simply don't have any value.

Screen Shot 2022-01-17 at 10 28 33

  1. When queuing this data using our library, athena-express, such fields are hidden from those rows, by default. If you set ignoreEmpty as false, then you see the value 0 for number.

But there is no way to know whether the original uploaded data value was empty or NULL. And since Athena itself supports both for all fields, including number (which is an anti pattern), then I think Athena-express shouldn't (and cannot accurately) correct that and therefore present that data as-is.

Let me know what you think

pvieito commented 2 years ago

If we leave the code "as is" it crashes with queries like this:

SELECT TRUE AS "BoolColumn"
UNION ALL
SELECT NULL
UNION ALL
SELECT FALSE

If the column is of Bool or Numeric it should parse an empty string as a NULL value (it cannot be an empty string).

ghdna commented 2 years ago

How will we distinguish if the original uploaded value was meant to be empty or null, given Athena doesn't distinguish between the two? Feel like we might be making a leap here with polluting the response since Empty and Null are not the same values.

pvieito commented 2 years ago

If the column has been marked by Athena as boolean it cannot contains empty values, only true, false or null. The same for numeric columns. Only varchar columns have this ambiguity, but typically data libraries like Pandas infer null for empty values when reading CSV files.

ghdna commented 2 years ago

That's what I was showing with the screenshot I posted above. Athena allows empty values for Boolean and Numeric columns. Upload a sample dataset with empty and null values for Athena and you will notice both are allowed and both show up as empty.

pvieito commented 2 years ago

I does not. It is interpreting them as NULL values and showing them as NULL. Try this in your example dataset:

SELECT *
FROM table
WHERE boolean is NULL

And obviously this is allowed:

SELECT TRUE
UNION ALL 
SELECT NULL
UNION ALL 
SELECT FALSE

But this is not:

SELECT TRUE
UNION ALL 
SELECT ''
UNION ALL 
SELECT FALSE

-- SYNTAX_ERROR: line 4:5: column 2 in UNION query has incompatible types: boolean, varchar(4)
ghdna commented 2 years ago

You're right in that Athena interprets uploaded empty values for Boolean/Number as NULL.

ghdna commented 2 years ago

With your PR, all fields (not just Boolean and number) that are empty will become null. And that will ignore the genuine empty values

pvieito commented 2 years ago

Yes, but note that this is the expected behavior in CSV parsers. Even Athena does this, if you upload a dataset with some empty values in a text column I will infer NULL values:

SELECT *
FROM table
WHERE text is NULL
ghdna commented 2 years ago

So you're suggesting replacing all empty values with NULL. Is there no use case where empty values could stay as empty values?

pvieito commented 2 years ago

Yes, that is the typical behavior of CSV/Athena parsers. At least boolean and numeric values are currently being incorrectly handled (they cannot contain empty strings). While varchar values can contain empty strings there is no way to distinguish between it and a NULL value so one should "win" (I personally prefer NULL but we could change the PR to return '' by default for varchar values).

ghdna commented 2 years ago

ok I'll accept the PR

ghdna commented 2 years ago

seems to work as expected. Are you noticing any issues?

On Sat, Jan 22, 2022 at 1:12 AM João Dias @.' via 33Mail @.> wrote:

This email was sent to the alias @.' by ' @.', and 33Mail forwarded it to you. To block all further emails to this alias click here http://www.33mail.com/alias/unsub/41bdfade131b958a27a3e30e285264d5

@.**** commented on this pull request.

In lib/helpers.js https://github.com/ghdna/athena-express/pull/72#discussion_r790109024:

  • case "boolean":
  • updatedObjectWithDataType[key] = JSON.parse(
  • input[key].toLowerCase()
  • );
  • break;
  • case "integer":
  • case "tinyint":
  • case "smallint":
  • case "int":
  • case "float":
  • case "double":
  • updatedObjectWithDataType[key] = Number(input[key]);
  • break;
  • default:
  • updatedObjectWithDataType[key] = input[key];
  • if (!input[key]) {

What if input[key] is false?

— Reply to this email directly, view it on GitHub https://github.com/ghdna/athena-express/pull/72#pullrequestreview-860173549, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGRKZXMWZ2FMLYKM5IK6WLUXJDD7ANCNFSM5IT6QWCA . You are receiving this because you modified the open/close state.Message ID: @.***>

alfaproject commented 2 years ago

@ghdna no, sorry. I just misread something, nvm