prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.07k stars 5.38k forks source link

array_position doesn't support nulls while velox does #23412

Open DanielHunte opened 3 months ago

DanielHunte commented 3 months ago

The Presto function array_position throws and exception when there are any nulls present in the array. Velox handles nulls and produces a valid output.

This is related to #10580

Your Environment

Expected Behavior

Presto should handle nulls and produces a valid output.

Current Behavior

An exception is thrown when nulls are present.

Possible Solution

Steps to Reproduce

Query: select array_position(c0, c1) from (values (array[array[1, null]], array[1, null])) t(c0, c1);

Screenshots (if appropriate)

Context

Presto and Velox Correctness

spershin commented 2 months ago

It is not very clear, honestly, what and how we should fix here. Presto throws only when there are nested types with nulls: select array_position(c0, c1) from (values (array[array[1, null]], array[1, null])) t(c0, c1); while Velox returns 1.

Queries like

SELECT array_position(c0, 5) from (values (ARRAY[ 4, 5, 6])) t(c0);
SELECT array_position(c0, 5) from (values (ARRAY[ 4, 5, NULL])) t(c0);
SELECT array_position(c0, 5) from (values (ARRAY[ NULL, 5, 6])) t(c0);
SELECT array_position(c0, NULL) from (values (ARRAY[ NULL, 5, 6])) t(c0);

work fine.

In the 1st case, is array[1, null] equal to array[1, null]? According to SQL standard NULL is "not equal" to another NULL.

So, technically, should Velox return NULL for the 1st query?

Take

SELECT array_position(c0, 5) from (values (ARRAY[NULL, 5, 6])) t(c0);

Should we return NULL here too, because the 1st element NULL might be 5 or might not be 5? Undefined.

Need some input here from multiple folks.

Throwing is kind of not nice, but then again, could be all right.

Yuhta commented 2 months ago

Since we already defined NaN == NaN despite IEEE standard said no, I think NULL == NULL is still an option on the table despite SQL standard said no.

pedroerp commented 1 month ago

From the function above:

SELECT array_position(c0, 5) from (values (ARRAY[NULL, 5, 6])) t(c0); // => 2

it seems like the semantic is that you should consider matches even if some comparisons return null (5 == NULL).

Then for the initial case, no comparisons will match (all will return NULL), since "array[1, null] == array[1, null]" is NULL;

SELECT array_position(c0, c1) from (values (array[array[1, null]], array[1, null])) t(c0, c1);

According to the docs, array_position() should return 0 if there are no matches. For example:

SELECT array_position(c0, 5) from (values (ARRAY[NULL, NULL, NULL])) t(c0); // => 0

From this logic, it seems like the function above should return 0 as well.

Curious to what @rschlussel @tdcmeehan and others who know this codebase better think though.

rschlussel commented 3 weeks ago

I think we should probably return null for nested nulls, but we could also potentially change the behavior to use distinct semantics (null == null). Whatever choice should be the same between nulls and nested nulls. Current behavior of throwing an exception for any nested nulls in the input isn't great.