Accelerate queries using sample tables in Presto

prestodb / presto

The official home of the Presto distributed SQL query engine for big data

http://prestodb.io

Apache License 2.0

15.75k stars 5.28k forks source link

Accelerate queries using sample tables in Presto #14902

Open ugurmeet opened 3 years ago

ugurmeet commented 3 years ago

The basic proposal is to

Allow Presto to accelerate queries by replacing original table with sample tables in queries
Acceleration enabled by setting session parameter
The Connector Metadata can specify if a sample table exists for a parent table and the sampling percentage
Presto then optimizes the query, replacing the TableScans for the original table by sample table scans
Also, aggregation operations like count, sum, approx_distinct can be scaled up based on the sampling percentage of the sample table

The idea is to implement this as an optimization rule, so that the changes are contained in one places.

mbasmanova commented 3 years ago

@ugurmeet This can be implemented using com.facebook.presto.spi.ConnectorPlanOptimizer. Someone tried to build this kind of system at FB, but it turned out to be very hard. E.g. simple queries work, but "real" queries are way too complex. For example, if you have a join you need to carefully sample both sides to avoid getting nothing.

gabijs commented 3 years ago

@ugurmeet: what is the required metadata information about sample tables? The correct transformation on aggregations may depend on the sampling method used to obtain the sample table and the query shape. There is also the issue raised by @mbasmanova: one needs sufficient metadata to apply correct transformation for joins.

himanshpal commented 3 years ago

@mbasmanova - How does sampling works currently at FB ? Sampled tables would need to work with multiple engines (Presto, Spark, Flink etc) so I am guessing this would be happening at proxy/gateway layer

mbasmanova commented 3 years ago

I don't believe we have a sampling solution. CC: @gabijs

himanshpal commented 3 years ago

I do remember watching this sometime earlier this year & it was really informative - https://www.facebook.com/watch/?v=322427055210984

gabijs commented 3 years ago

@HimanshPal : Whatever transformation we do in sampling is codified outside of the query engine precisely because there are subtleties that are not ready to be integrated directly as part of the engine. Although sampling metadata belongs to our metadata service that Presto accesses, the information is leveraged by this external component and Presto only sees the already transformed query.

ugurmeet commented 3 years ago

Hi @mbasmanova, @gabijs, Here is my proposal about using sampled tables in Presto. Hopefully it will answer some of your questions. Please feel free to comment. I would like to sync up before going further down this path. Thanks

mbasmanova commented 3 years ago

@ugurmeet Thanks for sharing the design. Looks interesting. I don't understand the following statement:

Filter nodes can prune map if the filtering is happening based on sampling
column e.g. if a table is sampled on trip_id and query is ‘select …. from table
where trip_id = xxx’ then there is no point in scaling aggregations in the query

In this case I think sampling shouldn't be used, because you may either get accurate result or completely wrong result (e.g. 0).

ugurmeet commented 3 years ago

@mbasmanova, you are right. Its very likely the results will be 0 in that case so better not to use sampling (error out). These are the kinds of rules that I want to flush out. Thanks.

mbasmanova commented 3 years ago

@ugurmeet I'm also wondering whether it is a good idea to fail queries if sampling cannot be applied. I think it is better to just not use sampling in these cases, but allow the query to proceed. E.g. opt-in flag means "feel free to use sampled tables if you can give me approximate results faster".

ugurmeet commented 3 years ago

@mbasmanova I was thinking the failure would make it clear to the users as to whether the sampling worked for them. Also, it would allow the developers to see how many of the queries are failing due to non supported scaling and examine those cases more closely. But I guess this information (about whether the query got successfully sampled and scaled) can also be captured in the Query logs.

mbasmanova commented 3 years ago

Indeed, runtime information about sampling performance can go into logs. In my experience, users don't like query failures and prefer reliable execution over fast execution. Many would be reluctant to use sampling if it may fail. Using sampling in pipelines that run in the background would be practically impossible.

ugurmeet commented 3 years ago

Sounds good. That would save a retry from the higher layers.