prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

width_bucket() does not seem to process NULL in the array elements properly. #24055

Open spershin opened 2 hours ago

spershin commented 2 hours ago

These return 0: SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL]) t(c0); SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL, 1]) t(c0); SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL, 4]) t(c0); SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL, 1, NULL, 4]) t(c0);

These return 1: SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL]) t(c0);

These return 2: SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL, 1]) t(c0); SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL, 4]) t(c0);

These return 3: SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL, 1, NULL, 4]) t(c0);

These fail with "Bin values are not sorted in ascending order": SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL]) t(c0); SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL, 4, NULL]) t(c0); SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL]) t(c0); SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL, 4, NULL]) t(c0);

And so on. There is no check for null elements in the code and unclear how the function should behave if it encounters one. From strict perspective it seems like the function should fail whenever it finds a null element because "Bin values are not sorted in ascending order". What is interesting is depending on the 1st argument and the array, it is not guaranteed that we stumble on a particular NULL due to the binary search nature of the algorithm.

Trying to understand if we should change the function behavior.

Expected Behavior

Unclear.

Current Behavior

See description.

Possible Solution

  1. If NULL encountered - fail the query.
  2. Scan for any null elements beforehand and fail if found - cold be too expensive.
  3. Leave as is, update documentation.

Steps to Reproduce

In any environment run the example queries.

mbasmanova commented 2 hours ago

CC: @aditi-pandit @rschlussel

tdcmeehan commented 2 hours ago

This variation of width_bucket is defined in the SQL spec and clearly defines the null behavior. However, I cannot find the overload mentioned in this issue in the SQL spec, nor do I see a comparable version in Postgres.

Yuhta commented 1 hour ago

I would vote for solution 2 as a null in bins cannot be interpret in any meaningful ways. Checking if they are nulls or not should not be too expensive as the nulls bits are contiguous in memory (so you can check 64 or 256 of them at same time, making it essentially free).