Slow return when using q3c with EXIST

jrmullaney commented 5 years ago

Hi there,

We're trying to carry out a query that seems to be taking much longer than we'd expect. It seems to boil down to:

SELECT t1.objectId FROM table1 AS t1 WHERE EXISTS (SELECT 1 FROM table2 AS t2 WHERE q3c_join(t2.ra_query, t2.dec_query, t1.ra, t1.dec, 0.0008333));

To give an idea of numbers: in our tests there are around 30 rows in t2, and around 10^8 rows in t1; t1 is indexed on (ra,dec) with q3c, t2 is not indexed. There should be 30 matches.

Oddly, the subquery is very quick when ran separately, and if we manually enter the results from the separately-run subquery into the main query, then it is also very quick. However, when put together, they are very slow.

Also oddly, when we replace (t2.ra_query, t2.dec_query) with scalar values, the whole query runs very quickly.

We were wondering if there's something specific that will cause q3c_join to work slowly when combined with an EXISTS or IN statement (we've also tried the latter with the same results).

Thanks for your help, and for q3c in general!

segasai commented 5 years ago

Hi, 1) It'd help to see query plans (and EXPLAIN ANALYZE)

2) The way you wrote the query is that will be slow by design. It will have to do 30x1e8 distance calculations. because you are asking for each object in t1 to check if it is in in the t2 (table which doesn't have an index).

So depending on what you are doing, this query will work much better

SELECT distinct t1.objId FROM t2, t1 WHERE  q3c_join(t2.ra, t2.dec, t1.ra, t1.dec, 0.0008333);

I tested this on a simple testcase

create table t1 as select generate_series as objid, random()*360 as ra, random()*60-30 as dec from generate_series(0,10000000);
create table t2 as select random()*360 as ra, random()*60-30 as dec from generate_series(0,30);
create index  on t1(q3c_ang2ipix(ra,dec));
 create index  ON  t1(objid);

jrmullaney commented 5 years ago

Hi,

Thanks for the very quick response, and the suggestion, but I'm afraid it doesn't fix the problem we have.

I created the same tables you made in your simple testcase and I agree that the suggested query is very quick. Unfortunately, however, it's not suitable for the full query we want to make, in which we need to combine information from three tables (please see the end of this message for the full query we wish to make).

If, for example, I take the testcase and try doing:

SELECT t1.objid FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE q3c_join(t2.ra, t2.dec, t1.ra, t1.dec, 0.0008333));

Then this also takes a long time to complete (we've never run it to completion - we kill it before it completes).

To answer your point 1, EXPLAIN gives:

 Nested Loop Semi Join  (cost=0.00..37308531210.00 rows=1252364 width=4)
   Join Filter: ((q3c_sindist(t2.ra, t2."dec", t1.ra, t1."dec") < '5.2880737999157e-11'::double precision) AND (((q3c_ang2ipix(t1.ra, t1."dec") >= q3c_nearby_it(t2.ra, t2."dec", '0.0008333'::double precision, 0)) AND (q3c_ang2ipix(t1.ra, t1."dec") <= q3c_nearby_it(t2.ra, t2."dec", '0.0008333'::double precision, 1))) OR ((q3c_ang2ipix(t1.ra, t1."dec") >= q3c_nearby_it(t2.ra, t2."dec", '0.0008333'::double precision, 2)) AND (q3c_ang2ipix(t1.ra, t1."dec") <= q3c_nearby_it(t2.ra, t2."dec", '0.0008333'::double precision, 3))) OR ((q3c_ang2ipix(t1.ra, t1."dec") >= q3c_nearby_it(t2.ra, t2."dec", '0.0008333'::double precision, 4)) AND (q3c_ang2ipix(t1.ra, t1."dec") <= q3c_nearby_it(t2.ra, t2."dec", '0.0008333'::double precision, 5))) OR ((q3c_ang2ipix(t1.ra, t1."dec") >= q3c_nearby_it(t2.ra, t2."dec", '0.0008333'::double precision, 6)) AND (q3c_ang2ipix(t1.ra, t1."dec") <= q3c_nearby_it(t2.ra, t2."dec", '0.0008333'::double precision, 7)))))
   ->  Seq Scan on t1  (cost=0.00..163696.15 rows=10000115 width=20)
   ->  Materialize  (cost=0.00..37.75 rows=1850 width=16)
         ->  Seq Scan on t2  (cost=0.00..28.50 rows=1850 width=16)
(5 rows)

(I can't do EXPLAIN ANALYSE as it takes too long to complete)

Regarding point 2: I may have misunderstood, but I think I have the tables round the right way. In my original post, t1 is the big table with q3c indices, and t2 is the small one with no indices. This seems to be the same way round as the test case in your post. Did you mean when doing the EXISTS task, as opposed to q3c_join?

Thank you again for your help.

The full query (i.e., why we're doing this): We have three tables: u which contains unique objects, d which contains duplicate measurements of the objects in u, and q which contains the much smaller list of sources we'd like to find in the big tables. We first want to find all instances of q in u, then count how many measurements (in d) we have for each of these objects. Our full query looks something like:

SELECT u.objectId, COUNT(d) FROM u LEFT JOIN d ON u.objectId = d.objectId WHERE EXISTS (SELECT u.* FROM u, q WHERE q3c_join(q.ra_query, q.dec_query, u.ra, u.dec, 0.0008333)) GROUP BY u.objectId;

By contrast:

SELECT u.objectId, COUNT(d) FROM u LEFT JOIN d ON u.objectId = d.objectId WHERE EXISTS (SELECT * WHERE c.objectId>2651529464986470000 AND c.objectId<2651529464986471000) GROUP BY c.objectId;

works fine.

segasai commented 5 years ago

I argue that the way you structure your query is not quite right and is not optimal by construction.

Reading what you want in the end I've created test tables and a query that does what you want.

create table u1 as select generate_series as objid, random()*360 as ra, random()
*60-30 as dec from generate_series(0,1000000); -- uniques
create table d1 as select u1.* from u1, generate_series(0,10); -- data with repeats
create table q1 as select * from u1 order by random() limit 100; -- small table to check
create index  ON u1(q3c_ang2ipix(ra,dec));
create index  ON u1(objid);
create index  ON d1(objid);

This query does what you want and is very fast with x as (select q1.objid, u1.objid as uobjid from q1 left join u1 on q3c_join(q1.ra,q1.dec,u1.ra,u1.dec,1./3600)) select x.objid, count(*) from x, d1 where d1.objid=x.uobjid group by x.objid;

segasai / q3c

Slow return when using q3c with EXIST #18