Closed GoogleCodeExporter closed 9 years ago
I should have noted that we created an index on q3c_ang2ipix(ra,dec) for the
tables after the table was created and then clustered on that index.
Original comment by taq...@gmail.com
on 9 Oct 2012 at 6:53
Hi,
Thanks for the report.
I think I know what's the issue.
The main problem here is that handling of large angles is indeed a bit
problematic in q3c. q3c_radial_query for example does switch to the whole
table scan in the case of very large angles. And q3c_join actually may not
have enough protection against big angles at all.
Another problem is that if you have the table with ra,dec and uncertainties
(or search radius), and uncertainty can be very large, then you pretty much
have to sequentially scan the whole table in order to find out whether the
uncertainty circle covers a given point or not (unless the index also
includes the uncertainty).
So, I'll take a look at q3c_join to prevent the crash.
In the meantime you can just use q3c_dist(99,18,ra,dec)<sr instead -- it
will be the exactly the same (because in the way you are using q3c_join, the
index won't be used anyway).
Could you also show me how did you use q3c_radial_query ?
Also, one possible recommendation for dealing with tables with large
uncertainies
would be (assuming that the fraction of objects with really big uncertainties is
small) is the following:
## Create the test dataset with uncertainties from 0 to 40deg with 10%
## >10deg
CREATE TABLE testbig as select random()*360 as ra ,random()*180-90 as
dec,random()**2*40 as sr from generate_series(0,33000000);
CREATE TABLE testbig_sorted as select * from testbig order by
(sr<10)::int,q3c_ang2ipix(ra,dec);
## sort the data in such a way that large uncertainty objects are clustered
## Create the 33mil test dataset with uncertainties from 0 to 40deg, 10% of
them
## >10deg
CREATE TABLE testbig as select random()*360 as ra ,random()*180-90 as
dec,random()**2*40 as sr from generate_series(0,33000000);
CREATE TABLE testbig_sorted as select * from testbig order by
(sr<10)::int,q3c_ang2ipix(ra,dec);
## sort the data in such a way that large uncertainty objects are clustered
## create indices
create index on testbig_sorted(sr);
create index on testbig_sorted(q3c_ang2ipix(ra,dec));
# select objects covering a given point ra=180 dec=30
select * from testbig_sorted where
(q3c_join(180, 30, ra, dec, 10) and sr< 10 and q3c_dist(ra, dec, 180, 30) < sr ) OR
( sr > 10 and q3c_dist(ra, dec, 180, 30) < sr);
# it takes <~4 seconds on my machine.
The idea is that you group objects which have large uncertainty -- those you
better check one by one, and for the rest do the normal cone search. I used
10degrees as a cutoff, but it could be different depending on the distribution
of uncertainties.
Sergey
Original comment by koposov
on 9 Oct 2012 at 8:14
https://code.google.com/p/q3c/source/detail?r=88a448a1a9bd27cddc8172f20671ecae08
ebf3c6
I've committed the q3c_join() fix. But I didn't have time yet to check it.
Original comment by koposov
on 9 Oct 2012 at 8:38
Okay, the problem was deeper than I expected, but i've hopefully fixed it in
the following commit:
http://code.google.com/p/q3c/source/detail?r=e42285cdc04f96b73cee635fdd0160a36ee
4c7cf
I'll probably release the 1.4.14 version soon.
Regarding the queries in my comment #2 (I mistyped, I meant random()^10*40
instead of random()**2*40 )
Original comment by koposov
on 10 Oct 2012 at 3:21
Hi Sergey,
Thanks for the quick response. We'd already adopted the approach you
suggested below and separated our data into two tables pos_big and
pos_small where the size/errors for pos_big were > 1 degree while pos_small
had all of the rows with size/errors <= 1 degree. Only about 1% of the
data has large errors, so pos_big is (contrary to its name!) a much smaller
table than pos_small. As you point out, regardless of the approach used,
we'll need to essentially do a full table scan of the pos_big table, but it
is small enough that that is tolerable. While we could put all of these
into a single table (as you do below) and use multiple indexes on the table
when we were looking at doing this using other schemes it seemed easiest to
split it into distinct tables.
Operationally it would be convenient if I could use the same syntax in
querying both tables (i.e., have a standard syntax for doing cone searches)
and I'd certainly be a bit leery of using a library that can crash for a
perfectly valid (if not especially efficient) query. I don't mind if q3c
is inefficient in doing this large angle query -- just that it doesn't
fail. We also allow users to enter SQL directly, so I'd also prefer not to
have a known vector for malicious users to easily crash our system.
If you'd like to see what we're doing, I've got a little blog post up at
http://skyview.gsfc.nasa.gov/xaminblog/index.php/2012/10/01/positional-matching-
in-postgres/
that describes what I'm doing and various approaches we've looked at. When
I first started with Q3C I did both correlations and cone searches using
q3c_join, where I found that cone searches were efficient when I used
q3c_join(187,2,ra,dec,1)
but slow with
q3c_join(ra,dec,187,2,1)
I then reread the documentation and saw that you recommended
q3c_radial_query for cone searches. and tried it, but it was much slower
than q3c_join. I tried it in both orientations
q3c_radial_query(ra,dec,187,2,1) and q3c_radial_query(187,2,ra,dec,1)
even though I think the documentation strongly suggests the first. However
I didn't investigate this too thoroughly since q3c_join worked just fine
(until I ran into the bug).
A typical cone search query that we're trying looks like:
select table_name,count(*) from pos_small
where q3c_join(187,2,ra,dec,dsr)='t' and
q3c_join(187,2,ra,dec,1)='t'
group by table_name
union
select table_name,count(*) from pos_big
where q3c_join(187,2,ra,dec,dsr)='t'
group by table_name
order by 1;
The ='t' is superfluous but is needed by our SQL security checker scan
which doesn't know that functions can have logical values. It doesn't seem
to affect indexing (or I'd dig into our JSQLParser implementation to add
logical functions).
I've tried putting the ra,dec as the first arguments in q3c_join and that
is much slower.
Using q3c_radial_query instead of q3c_join seems to be much slower in both
orientations.
The second q3c_join in the first element of the union matches what you had
suggested below. It doesn't look like the Q3C uses the index efficiently
if the search radius is a field. This seems common to all of the indexing
approaches we tried (see the blog article).
In any case thanks for responding so quickly. I'll put in your patches
today and try things out and let you know if it works.
Tom McGlynn
Original comment by taq...@gmail.com
on 10 Oct 2012 at 12:01
Hi Tom,
First, regarding the order of arguments issue, It is very important for the
index usage indeed, so for example if instead of
q3c_radial_query(ra,dec,180,30,4) you'll write
q3c_radial_query(180,30,ra,dec,4) that'll guarantee that the index won't be
used. I thought that the documentation was clear enough about that, but
apparently it wasn't.
The same applies to q3c_join.
Regarding the slowiness of q3c_radial_query vs q3c_join, it is to some
degree a known issue and is by design, but maybe it's time to revisit.
Because q3c_radial_query is optimized for the large catalogues, where you
really want to do the smallest amount of I/O possible. So it is much more
CPU heavy, but it is very I/O efficient: it really tries to approximate the
query circle with the quad-tree squares and retrieve from the disk the
smallest number of blocks. So if you'll try to compare q3c_radial_query
vs q3c_join for very large catalogs (e.g. SDSS like or 2mass) q3c_radial_query
will in general outperform q3c_join. But when all the data small and/or is
fully cached q3c_join will win due to smaller CPU overhead.
Regarding the stability of q3c, I have been using it actively for ~ 7 years
and I know a lot of other people does. I have probably tens of billions of
sources in total indexed and queried by q3c. So I'm pretty sure that it is very
stable in the standard use case of not too large search radii <~ 20-30deg.
(that what the regime q3c was designed for and used). Regarding the very
large queries, I'm pretty sure that cone and ellipse searches with very big
radii should be fine as well. Very large polygon queries are disabled (e.g
with diameters >~ 40-50 degrees). Regarding the crossmatches with very large
radii, I've put protection against very large radii just now, so I think it
should be fine as well, but obviously I can't be 100% sure that it's rock
stable. I also know that polygon queries are used much less than other
kind of queries, so the likelihood of bugs in that part of the codebase is
higher.
Thanks for the blog post. It is very helpful. I may write some more
comments later. I can say that I'm using PGsphere sometimes when I have to
work with
image/survey footprints and their overlaps, or when I have to find out which
images cover a given point on the sky. In those cases PGSphere is definitely
very good (and especially in those cases usually you don't have large number
of objects/images, so the slowness of R-trees doensn't manifest itself.
PS I'll be happy to discuss anything related to q3c or in general PG spatial
queries by email koposov@ast.cam.ac.uk (it's a bit of pain to use
code.google.com form for that), because I'm very interested in the subject and
happy to hear and fix any possible deficiencies of q3c.
Original comment by koposov
on 10 Oct 2012 at 2:43
Hi Tom,
I've just released the new version of q3c, which should perform much better
with q3c_radial_query() queries, so I would suggest to try it instead of
q3c_join, if you have time.
It also includes some bug fixes with regards to polygon queries, so it's
probably a good idea to update anyway.
Regards,
Sergey
Original comment by koposov
on 17 Dec 2012 at 6:12
Don't use this account much so I just saw this. Thanks for the update and
we've grabbed the lastest version.
Tom
Original comment by taq...@gmail.com
on 9 Feb 2013 at 5:15
Original issue reported on code.google.com by
taq...@gmail.com
on 9 Oct 2012 at 6:50