h2database / h2database

H2 is an embeddable RDBMS written in Java.
https://h2database.com
Other
4.2k stars 1.19k forks source link

Slow query execution when using IN (value1, value2...) even though UNION with individual = conditions works fine #3747

Closed ameyer3000 closed 9 months ago

ameyer3000 commented 1 year ago

I am struggling with slow query execution for a WHERE-clause such as field IN ('value1', 'value2' ), whereas equivalent queries (using UNION or CTE, see below) are performed as expected. Self-contained SQL below, tested with the standalone console (java -jar h2-2.1.214.jar).

DROP TABLE test IF EXISTS;
CREATE TABLE test(
  name CHARACTER VARYING,
  type CHARACTER VARYING,
 PRIMARY KEY (type, name)
);

INSERT INTO test 
SELECT 
  'NAME' || name_range.x, 
  'TYPE' || type_range.x
        FROM system_range(0,99999) name_range
  CROSS JOIN system_range(0,9) type_range;

-- 1 million rows
SELECT count(*) FROM test;

-- this is OK: "scanCount: 2"
EXPLAIN ANALYZE 
  SELECT * FROM test WHERE name = 'NAME9999' AND type = 'TYPE1';

-- following two are NOT OK: "scanCount: 200001, reads: 8137"
-- looks like the PK can only be used for field "name"
EXPLAIN ANALYZE
   SELECT * FROM test WHERE name = 'NAME9999' AND type IN ( 'TYPE1', 'TYPE2' );

EXPLAIN ANALYZE
   SELECT * FROM test WHERE name = 'NAME9999' AND ( type = 'TYPE1' OR type = 'TYPE2' );

-- next two are OK again: "scanCount: 2"
EXPLAIN ANALYZE
   SELECT * FROM test WHERE name = 'NAME9999' AND type = 'TYPE1' 
  UNION
   SELECT * FROM test WHERE name = 'NAME9999' AND type = 'TYPE2';

EXPLAIN ANALYZE
  WITH types AS (SELECT * FROM ( VALUES ('TYPE1'), ('TYPE2') ) AS t(type) )
  SELECT * FROM test 
    JOIN types t ON test.type = t.type
   WHERE name = 'NAME9999' 

As the IN ( ... )-condition isn't really complex, I would assume that the primary key can be used. Am I missing something, or is this optimization not yet implemented (as suggested by user "Impaler" on stackoverflow ).

NOTE: when changing the primary key definition to PRIMARY KEY (name, type), the problem kind of disappears for this particular example - however this still wouldn't entirely fix the general problem as I don't have full control over the schema definition and even if I had, similar queries where other fields from the PK are appearing in field IN ( ... ) would still be slow.

katzyn commented 1 year ago

You need an additional index on TEST(NAME)

CREATE INDEX TEST_NAME_IDX ON TEST(NAME);

and, unfortunately, you also may need USE INDEX clause, because query planner incorrectly chooses a less efficient index:

EXPLAIN ANALYZE
SELECT * FROM test USE INDEX(TEST_NAME_IDX) WHERE name = 'NAME9999' AND (type = 'TYPE1' OR type = 'TYPE2');
> SELECT
>     "PUBLIC"."TEST"."NAME",
>     "PUBLIC"."TEST"."TYPE"
> FROM "PUBLIC"."TEST" USE INDEX ("TEST_NAME_IDX")
>     /* PUBLIC.TEST_NAME_IDX: NAME = 'NAME9999' */
>     /* scanCount: 11 */
> WHERE ("TYPE" IN('TYPE1', 'TYPE2'))
>     AND ("NAME" = 'NAME9999')
ameyer3000 commented 1 year ago

That's certainly interesting! However for my use-case, it would be complicated/impossible to add these hints, as I need to support two different database engines at the same time - and that for many different types of queries with many other fields that may or may not use IN (list of literals). In my application, queries are not directly written in SQL, coming from application A that doesn't have knowledge about schema internals like indices etc, and another application B that has (among others) an H2-"backend" that generates SQL based on some parameters.

Any chance for a more generic fix, at least for field IN (list of literals)? Any pointers to the source-code, where this can be tweaked? Not sure if it's easy, in particular for someone without much DB background (like myself). Adding an index for each field is OK (that much control I do have over the schema), but then adding USE INDEX seems to be a workaround only for the case where at most one field uses IN (...) - as soon as there are two or more such expressions, this workaround doesn't seem to apply (does it? I didn't succeed adding multiple USE INDEX clauses).

katzyn commented 1 year ago

A query can use only one index.

WHERE column IN (list) doesn't need any quirks in H2, it only needs an index on (column) or (column, other_columns, …) with column first.

But in your case WHERE criteria specifies multiple columns. H2 cannot optimize all such queries well enough.

In your query there are two problems:

  1. column1 = something AND column2 IN (list) don't construct index condition with a list of value pairs for two-column index on (column1, column2). This feature can be introduced in H2, but it requires a good knowledge of H2's internals.
  2. Even without that feature H2 is still able to use index on column1 for column1 = something OR index on column2 for column2 IN(list) (indexes may have other columns, but these columns need to be specified first) and H2 needs to choose more efficient strategy. Currently H2 prefers less efficient index on (column2, column1) for column2 IN(list), but with your distribution of data this strategy is very inefficient. Without index on (column2, column1), but with individual indexes on (column1) and (column2) H2 correctly chooses an index on (column1). It means there is a problem with index cost calcutaion. The related code is in Index.getCostRangeIndex().
ameyer3000 commented 1 year ago

I think I found an acceptable workaround for me. After populating all my tables (with unique-constraint/PK), there will only be SELECTs. After dropping the PK as well as creating single-field indices for all PK fields for all tables, SELECTs don't require excessive scans even without hints like USE INDEX. This approach is "generic enough" for my purpose.

I had a quick look at Index.getCostRangeIndex but I guess I would have to invest a lot more time to be of real help there, unfortunately. Not sure if this issue should be closed (I have a workaround), or kept open (a more generic solution would still be nice where PK can be kept, no USE INDEX).

Anyhow, thanks a lot for your ideas!

sibsssidor commented 1 year ago

If I may jump aboard this discussion thread, looks like hibernate 6 changed the way it generates queries for embedded IDs from this form

select p1_0.x,p1_0.y from mytable p1_0 where p1_0.x = ‘mypieceofId1’ and p1_0.y= 1677042000000

to this one:

select p1_0.x,p1_0.y from mytable p1_0 where (p1_0.x,p1_0.y) in((‘mypieceofId1’, 1677042000000))

Unless I'm very much mistaken (if I am, pls tell me...) the second one does not seem to use any index (I tried both fields, in reveres, and individual ones). Pls take this as a personal preference, the second one is not how I would write a SQL query, but given how widespread hibernate is and given that embedded IDs are not uncommon, I am afraid this question will pop up more often. Related discussion on hibernate forum: https://discourse.hibernate.org/t/upgrade-to-hibernate-6-insert-performance-with-embedded-ids/7515

Any opinion/advice wcome :-)

katzyn commented 1 year ago

select p1_0.x,p1_0.y from mytable p1_0 where (p1_0.x,p1_0.y) in((‘mypieceofId1’, 1677042000000))

This is an entirely different already fixed problem, you can compile H2 from its current sources to test the fix.

sibsssidor commented 1 year ago

select p1_0.x,p1_0.y from mytable p1_0 where (p1_0.x,p1_0.y) in((‘mypieceofId1’, 1677042000000))

This is an entirely different already fixed problem, you can compile H2 from its current sources to test the fix.

that is great (and really appreciated)! let me try :-) !

sibsssidor commented 1 year ago

Built and tested, it picks up the index indeed. I will have to bring in a snaphot jar into my build but definitely small work compared to try get around the new hibernate behavior @katzyn tx for the help, really appreciated!

manticore-projects commented 9 months ago

Closing this as resolved, answered and stale. Please re-open when there are any unexplained answers left.