prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

SliceDictionarySelectiveReader filter evaluation failure #19138

Closed yingsu00 closed 1 year ago

yingsu00 commented 1 year ago

Repro

 CREATE TABLE glue.tpcds_sf100_orc.household_demographics (
    "hd_demo_sk" bigint,
    "hd_income_band_sk" bigint,
    "hd_buy_potential" char(15),
    "hd_dep_count" integer,
    "hd_vehicle_count" integer
 )
 WITH (
    format = 'ORC'
 );

INSERT INTO glue.tpcds_sf100_orc.household_demographics SELECT * from tpcds.sf100.household_demographics;

set session glue.pushdown_filter_enabled=true;

-- Should return 1200, but returned 0
select count(*) from household_demographics where hd_buy_potential='>10000';
 _col0
-------
     0
(1 row)

set session glue.pushdown_filter_enabled=false;
 select count(*) from household_demographics where hd_buy_potential='>10000';
 _col0
-------
  1200
(1 row)

The root cause was that the selective readers push down and evaluate the filters in SliceDictionarySelectiveReader, but it thinks all values are 15 bytes long. On the other hand the dictionary values were truncated, and doesn't include the paddings. The rowGroupFilter for '>10000' was with 6 chars only.

    private int readWithFilter(int[] positions, int positionCount)
            throws IOException
    {
                int length;
                if (isCharType) {
                    length = maxCodePointCount;   // maxCodePointCount = 15
                }
                else {
                    length = inRowDictionary ? rowGroupDictionaryLength[rawIndex] : stripeDictionaryLength[rawIndex];
                }
                if (evaluationStatus == null) {
                    evaluateFilter(position, index, length);
                }
    }

    private byte evaluateFilter(int position, int index, int length)
    {
        if (!rowGroupFilter.testLength(length)) {.  // always fail because length  = 15 but filter size was 6.
            return FILTER_FAILED;
        }
    }
yingsu00 commented 1 year ago

The filter with paddings should output the same but they all fail:

presto:test> set session hive.pushdown_filter_enabled=false;
SET SESSION
presto:test> select count(*) from household_demographics where hd_buy_potential='>10000         ';
 _col0
-------
  1200
(1 row)
presto:test> set session hive.pushdown_filter_enabled=true;
SET SESSION
presto:test> select count(*) from household_demographics where hd_buy_potential='>10000   ';
 _col0
-------
     0
(1 row)