hydradatabase / hydra

Hydra: Column-oriented Postgres. Add scalable analytics to your project in minutes.
https://www.hydra.so
Apache License 2.0
2.83k stars 76 forks source link

Confused about columnar cache regression test #261

Open japinli opened 6 months ago

japinli commented 6 months ago

What's wrong?

Hi,

When I read the regression test in columnar_cache.sql, I noticed that it contains the following test case:

CREATE TABLE big_table (
  id INT,
  firstname TEXT,
  lastname TEXT
) USING columnar;

INSERT INTO big_table (id, firstname, lastname)
  SELECT i,
         CONCAT('firstname-', i),
         CONCAT('lastname-', i)
    FROM generate_series(1, 1000000) as i;

-- get some baselines from multiple chunks
SELECT firstname,
       lastname,
       SUM(id)
  FROM big_table
 WHERE id < 1000
 GROUP BY firstname,
       lastname
UNION
SELECT firstname,
       lastname,
       SUM(id)
  FROM big_table
 WHERE id BETWEEN 15000 AND 16000
 GROUP BY firstname,
       lastname
 ORDER BY firstname;

-- enable caching
SET columnar.enable_column_cache = 't';

-- the results should be the same as above
SELECT firstname,
       lastname,
       SUM(id)
  FROM big_table
 WHERE id < 1000
 GROUP BY firstname,
       lastname
UNION
SELECT firstname,
       lastname,
       SUM(id)
  FROM big_table
 WHERE id BETWEEN 15000 AND 16000
 GROUP BY firstname,
       lastname
 ORDER BY firstname;

The comments claim that both queries produce the same outcome but columnar_cache.out results differ. The first query returns 2000 rows while the second only returns 999 rows.

Is this expected?

wuputah commented 6 months ago

good catch, it looks like the union part is not working when caching is enabled; @JerrySievert would you be able to comment here?

JerrySievert commented 6 months ago

I can take a look at it in the next day or so - it's been long enough on the caching code that I'd need to do a deeper dive