timescale / timescaledb-toolkit

Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
https://www.timescale.com
Other
365 stars 46 forks source link

Approximate member query #11

Open WireBaron opened 3 years ago

WireBaron commented 3 years ago

What's the functionality you would like to add

We'd like to provide an approximate member query, such as quotient filter, that can quickly answer if an element likely belongs to a set. This could then be used a filter to quickly rule out the existence of value in a database (or some range of a database) without performing a more expensive query.

How would the function be used

The implementation would provide a postgres aggregate which would compute the filter over some column of a table.

Consider a table listing real estate transactions, we might create a filter over the house location as follows:

# \d home_sales
                       Table "public.home_sales"
  Column   |           Type           | Collation | Nullable | Default
-----------+--------------------------+-----------+----------+---------
 sale_date | timestamp with time zone |           |          |
 address   | text                     |           |          |
 seller    | text                     |           |          |
 buyer     | text                     |           |          |
 price     | double precision         |           |          |

# CREATE VIEW address_filter AS SELECT quotient_filter(address) FROM home_sales;
CREATE VIEW

# SELECT prob_has_member(address_filter, '1342 Palmdale Ln');
 prob_has_member
-----------------
 f
(1 row)

These filters should also be combinable, allowing efficient queries over distinct subsets (or partial aggregations in the case of continuous aggregation).

# SELECT prob_has_member(combine_filters(address_filter_2019, address_filter_2020), '1342 Palmdale Ln');
 prob_has_member
-----------------
 t
(1 row)

Why should this feature be added?

Postgres does not provide an efficient mechanism to test for the existence of a particular value short of actually looking up the row.

What scale is this useful at?

This will be more useful for large datasets. In order for this to be useful the cost to lookup a particular row by the filtered value will have to be significant. This will also be most useful with a workflow that expects a high rate of negative checks (i.e. where most values checked are not present in the table).

Drawbacks

Quotient filters can be fairly large. In order to maintain efficient lookups, it may be necessary to store anywhere from 1.3 to 2.5 hashes per value (perhaps more depending on the implementation).

Open Questions

How should we determine how large to make the filter?
The false positive rate is directly related to both the size of the filter and the size of the table. We should be able to allow the user to specify a maximum tolerable false positive rate and size/resize the filter based upon that, but this could result in an unacceptably large table if set too aggresively on too large of a table. Alternatively we could have the user specify an allowed size and leave them to deal with the resulting false positive rate. It should be fairly straightforward to provide functionality to resize the table (by factors of 2), so it's not critical that this is set correctly initially. If the resizing is cheap enough, we'll also likely want to store smaller filters for partial aggregates.

Alternatives

Bloom filters also provide this functionality, but using a quotient filter based approach provides the following benefits over building an implementation on top of a bloom filter:

  1. Quotient filters can be efficiently merged without affecting their false positive rate.
  2. Quotient filters handle hash collisions more efficiently than bloom filters
  3. At false positive target rates less than 1/64, quotient filters require less space than bloom filters

Another approximate member query is the Cuckoo filter (https://en.wikipedia.org/wiki/Cuckoo_filter) which looks like it may be even more space efficient than the quotient filter. However, it look like this filter drops off dramatically in terms of lookup performance and space efficiency at lower rates of false positives.

davidkohn88 commented 3 years ago

Discussion in #34