Hash opclass for spoint

postgrespro / pgsphere

PgSphere provides spherical data types, functions, operators, and indexing for PostgreSQL.

https://pgsphere.org

BSD 3-Clause "New" or "Revised" License

16 stars 14 forks source link

Hash opclass for spoint #102

Closed df7cb closed 10 months ago

df7cb commented 11 months ago

Besides hash indexes on spoint, this enables "select distinct spoint from tbl" queries.

Close #101.

df7cb commented 11 months ago

That's already in there:

pg_sphere--1.3.1--1.3.2.sql: pgs_circle_sel.sql.in pgs_hash.sql.in
    cat upgrade_scripts/$@.in $^ > $@

esabol commented 11 months ago

That's already in there:

pg_sphere--1.3.1--1.3.2.sql: pgs_circle_sel.sql.in pgs_hash.sql.in
  cat upgrade_scripts/$@.in $^ > $@

Oh, sorry, I was expecting to see pg_sphere--1.3.1--1.3.2.sql in the list of changed files in the commit and didn't look closely enough at the changes to the Makefile.

df7cb commented 11 months ago

The current implementation just xor-s the 4x4 bytes chunks together. I think that should yield a good distribution on the "random" coordinate values that should be seen in practise. If we think that's not good enough, we might just use the built-in hash functions on floats. (hashfloat8, hashfloat8extended)

vitcpp commented 10 months ago

I think, everythink is ok, but I would propose to improve the hash function. The longitude range is [0, 2*pi], the latitude range is [-pi/2,p/2], EPSILON = 1E-9. We have to calculate the hashes in these ranges. The smaller numbers are "equal" and should have the same hash. There are also some other special values like INF, NAN. I think that all NANs should produce the same hash value. Denormalized numbers are smaller than 1E-9, they are 0. I will think about some other implementation of hash function.

df7cb commented 10 months ago

I don't think we can do much about mapping values less than EPSILON apart to the same value - we don't know in which direction to round. The only thing we could do is map values close to 0 to 0, but if we don't do it for the rest, why bother with 0.

Infinity is not a problem since it's a unique bit value, but along with NaN, can these really appear in spoint values? I even had trouble getting -0 into an spoint - at one point I think I had one, but couldn't reproduce it later.

One possible hash implementation that takes care of -0 and NaN would be this:

PG_RETURN_INT32(hashfloat8(p1->lat + p1->lng));

PG_RETURN_INT32(hashfloat8(p1->lat) ^ hashfloat8(p1->lng));

vitcpp commented 10 months ago

@df7cb May be to use hashfloat8? I have no strong arguments against your implementation except that it is not the best hash function, I think. I want to ask my junior colleagues to think how to improve hash function and benchmark it.

df7cb commented 10 months ago

I used the hash1 ^ hash2 approach now, and added documentation.