Unnecessary fallback to table scan

GoogleCodeExporter commented 9 years ago

When performing explain on the query

  select o1.ID
  from OBJECT1 o1
  where o1.COL1 = 447
    and o1.refID not in (select o2.ID from Object2 o2)

then I get the plan:

  SELECT
      O1.ID
  FROM PUBLIC.OBJECT1 O1
      /* PUBLIC.OBJECT1: COL1 = 422 */
  WHERE (O1.COL1 = 422)
      AND (NOT (O1.A_REFID IN(
      SELECT DISTINCT
          O2.ID
      FROM PUBLIC.OBJECT2 O2
          /* PUBLIC.OBJECT2.tableScan */)))

select count(*) from OBJECT1 : 1224
select count(*) from OBJECT1 where COL1=422 : 1059
select count(*) from OBJECT2 : 41474

I have left the query to run for an an hour and it did not return. Object2 is a 
very "fat" row, which does not help.

However, when I change the query to the following:

  select o1.ID
  from OBJECT1 o1
  where o1.COL1 = 447
    and o1.refID not in (select o2.ID from Object2 o2 order by ID)

then suddenly we get a much different plan, one that can even be analyzed:

  SELECT
      O1.ID
  FROM PUBLIC.OBJECT1 O1
      /* PUBLIC.OBJECT1: COL1 = 422 */
      /* scanCount: 1 */
  WHERE (O1.COL1 = 422)
      AND (NOT (O1.A_REFID IN(
      SELECT DISTINCT
          O2.ID
      FROM PUBLIC.OBJECT2 O2
          /* PUBLIC.PRIMARY_KEY_7258 */
      ORDER BY 1
      /* index sorted */)))
  /*
  total: 2
  JCS_OBJPRIVGRANT0.JCS_OBJPRIVGRANT0A read: 2 (100%)
  */

That is, by making the query more complex it makes it dramatically faster. I 
would not expect that adding the "order by" would make any difference in the 
query plan, however, the effect is truly dramatic.

What version of the product are you using? On what operating system, file 
system, and virtual machine?

I am running this against 1.3.153, on MacOSX.

Do you know a workaround?

Add order by to the subselect.

What is your use case, meaning why do you need this feature?

In our product this query is used to find orphaned objects. These can happen 
quite normally, and so need to be cleaned up.

How important/urgent is the problem for you?

If the tables get large enough, then the system becomes unusable, as this query 
blocks the entire system from working. I guess that if we turn on 
MULTI_THREADED we can work around that, however we still have the issue that a 
maintenance task that should only take a few seconds takes hours to complete.

Please provide any additional information below.

I can't provide the original data, but would be happy to test any fixes against 
my database.

Original issue reported on code.google.com by pwagl...@gmail.com on 17 Mar 2011 at 8:31

GoogleCodeExporter commented 9 years ago

You can use the following script to recreate the almost the same situation that 
I see. In particular, it has the spurious table scan. Because of the @LOOP 
construct, this needs to be done in the H2 console.

create table OBJECT1 ( ID BIGINT NOT NULL, REFID BIGINT NOT NULL, COL1 BIGINT 
NOT NULL, PRIMARY KEY ( ID ) );
create INDEX OBJECT10 on OBJECT1 ( REFID );
create INDEX OBJECT11 on OBJECT1 ( COL1 );
create table OBJECT2 ( ID BIGINT NOT NULL, PRIMARY KEY ( ID ) );

@LOOP 41474 insert into Object2 values (?);
@LOOP 1059 insert into Object1 values (?,?,422);
@LOOP 165 insert into Object1 values (?+1059,?,423);

explain select o1.ID
  from OBJECT1 o1
  where o1.COL1 = 422
    and o1.refID not in (select o2.ID from Object2 o2);

explain select o1.ID
  from OBJECT1 o1
  where o1.COL1 = 422
    and o1.refID not in (select o2.ID from Object2 o2 order by o2.ID);

drop table OBJECT1;
drop table OBJECT2;

Original comment by pwagl...@gmail.com on 17 Mar 2011 at 9:00

GoogleCodeExporter commented 9 years ago

With a bit of experimentation, then what is also interesting is that if 10000 
elements are inserted into Object2, then everything works just dandy, in both 
cases. It still does a table scan instead of an index scan, but it returns in a 
smallish number of ms.

However, go from 
  @LOOP 10000 insert into Object2 values (?);
to
  @LOOP 10001 insert into Object2 values (?);

the the analyze time goes from 26ms to 56823 ms. Talk about going off of a 
cliff!

Often when running with 10001 I also get other errors, out of memory and 
"unexpected code path" errors.

Original comment by pwagl...@gmail.com on 17 Mar 2011 at 9:21

GoogleCodeExporter commented 9 years ago

Last comment for the day…

The "unexpected code path" errors only turned up for me when doing "explain 
analyze", a plain explain did not cause them.

Increasing the amount of heap from default made the OOM exceptions go away 
(surprising? not really ;-)

Original comment by pwagl...@gmail.com on 17 Mar 2011 at 9:36

GoogleCodeExporter commented 9 years ago

Unfortunately, this problem is a little more serious for me… the SQL parser 
of the product that I am using does not allow the subselect to have an order by 
clause… So I have no woraround available.

Original comment by pwagl...@gmail.com on 18 Mar 2011 at 9:30

GoogleCodeExporter commented 9 years ago

Another data point:

Given the two (functionally equivalent) queries:

explain select o1.ID
  from OBJECT1 o1
  where o1.COL1 = 422
    and o1.refID not in (select o2.ID from Object2 o2);

explain select o1.ID
  from OBJECT1 o1
  where o1.COL1 = 422
    and not exists (select 1 from Object2 o2 where o1.refID = o2.ID);

The first generates a plan that does a table scan, and the second generates a 
query that uses the index.

Original comment by pwagl...@gmail.com on 19 Mar 2011 at 11:07

GoogleCodeExporter commented 9 years ago

OK. A bit more playing around, and it would appear the "root cause" of the 
problem is that it is not recognising the tranformation from:

  XX.F1 in (select YY.F2 from YY)

as being equivalent to

  exists (select 1 from YY where YY.F2 = XX.F1)

this causes two issues:
 1. A sub optimal query is produced, even if it were to use the index, it still does an index scan, and looks to see if the value is in that list, as opposed to doing a lookup.
 2. When the YY table gets over 10000 elements in size, the query has a very high chance of simply not working.

Original comment by pwagl...@gmail.com on 19 Mar 2011 at 11:33

GoogleCodeExporter commented 9 years ago

The unexpected code path exception has been spit off to issue 304.

Original comment by pwagl...@gmail.com on 19 Mar 2011 at 11:57

GoogleCodeExporter commented 9 years ago

Please note that the original hypothesis in this ticket was wrong. Adding the 
order by doesn't help, it actually makes the whole situation worse. The real 
problem is that identified in comment 6… I would change the ticket title, but 
do not seem to be able to do this.

Original comment by pwagl...@gmail.com on 20 Mar 2011 at 1:11

GoogleCodeExporter commented 9 years ago

Hi,

The problem with "IN(SELECT ... ORDER BY)" has been fixed.

It is true that "x NOT IN(SELECT ...)" is not converted to "NOT EXISTS(...)", 
because that's a quite tricky conversion. I will add a feature request. 
However, it currently doesn't have a very high priority for me, sorry.

Original comment by thomas.t...@gmail.com on 5 Apr 2011 at 5:16

Changed state: InRoadmap

GoogleCodeExporter commented 9 years ago

If you can give some hints as to where to start looking to do this, or even 
better a similar transformation that is done elsewhere, I can have a look at 
this, since this is important to me... I am happy to scratch that itch, but I 
need someplace to start poking.

Original comment by pwagl...@gmail.com on 5 Apr 2011 at 1:42

GoogleCodeExporter commented 9 years ago

> If you can give some hints as to where to start looking to do this

Hm, I'm afraid that's not so easy... Maybe ConditionNot / ConditionIn / 
ConditionExists optimize(), but it's quite complicated and easy to get wrong...

Original comment by thomas.t...@gmail.com on 9 Apr 2011 at 1:07

GoogleCodeExporter commented 9 years ago

I have a very early preliminary patch… just to make sure that I am on the 
road that you would like taken…

This works for my test cases, however I am pretty confident that it won't work 
for everything. I propose to keep the patch more or less as is, but add a few 
more conditions to the query so that the optimization is only done when we are 
sure that it is safe to do. So for example, to only do the remapping when there 
is no group by clause.

Does that make sense to you?

Original comment by pwagl...@gmail.com on 17 Apr 2011 at 9:26

Attachments:

optimize_in_select_to_exists.patch

GoogleCodeExporter commented 9 years ago

Hi,

I'm sorry, but I will not apply the patch. It's too much magic for me.

Regards,
Thomas

Original comment by thomas.t...@gmail.com on 11 Jun 2011 at 3:52

GoogleCodeExporter commented 9 years ago

I am more than happy to say that there is too much magic involved… how would 
you like to see it modified? Where should this be hooked in?

Original comment by pwagl...@gmail.com on 30 Jun 2011 at 10:31

GoogleCodeExporter commented 9 years ago

I don't really know yet... One problem is that cloning an object in this way is 
problematic for maintenance (when adding a new field, it's very easy to forget 
adding that to the clone constructor). Maybe the subquery could be cloned by 
creating the SQL statement, and then parsing it?

Original comment by thomas.t...@gmail.com on 22 Oct 2011 at 9:05

google-code-export / h2database

Unnecessary fallback to table scan #303