prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16k stars 5.36k forks source link

Aria Scan Optimizations #12585

Closed mbasmanova closed 4 years ago

mbasmanova commented 5 years ago

Query performance is often dominated by table scan. Hence, improving scan efficiency is important. We propose to change scan implementation to avoid doing extra work and producing extra data: https://code.fb.com/data-infrastructure/aria-presto/

The implementation consists of 3 major parts:

public interface ColumnHandle
{
    /**
     * Applies to columns of complex types: arrays, maps and structs. When a query
     * uses only some of the subfields, the engine provides the complete list of
     * required subfields and the connector is free to prune the rest.
     * <p>
     * Examples:
     *  - SELECT a[1], b['x'], x.y.z FROM t
     *  - SELECT a FROM t WHERE b['y'] > 10
     * <p>
     * Pruning must preserve the type of the values and support unmodified access.
     * <p>
     * - Pruning a struct means populating some of the members with null values.
     * - Pruning a map means dropping keys not listed in the required subfields.
     * - Pruning arrays means dropping values with indices larger than maximum
     * required index and filling in remaining non-required indices with nulls.
     */
    @Experimental
    default ColumnHandle withRequiredSubfields(List<Subfield> subfields)
    {
        return this;
    }
}

A prototype of the above functionality is currently available in aria-scan-research branch.

CC: @oerling @tdcmeehan @elonazoulay @yingsu00 @nezihyigitbasi @arhimondr @zhenxiao @bhhari @sayhar

arhimondr commented 5 years ago

Thanks for sharing this @mbasmanova! This looks like a great set of improvements.

Evaluate simple filters (TupleDomain) directly on encoded (ORC) without producing a Block with all the values first.

Do we consider replacing current LazyBlock implementation (that is "lazy" on a block level) to something like ReallyLazyBlock that is "lazy" on per-value basis? With ReallyLazyBlock the filters can still be evaluated by the engine (even complex disjunts like x LIKE '%blah1% OR x LIKE '%blah2%' OR someveryheavyfunction(x) = 'blah3') in a compiled way.

Monitor performance of filters (simple and complex) and dynamically change the order in which columns are read to evaluate most efficient filters first

This sounds very cool! It may give us a nice performance boost for the filter expressions like comment LIKE '%something%' AND userid = 12345'. I would presume the mistakes like that can be common in many adhoc queries people write.

Could you please elaborate a little bit more on that? Are you going to implement it in the engine itself? It may be beneficial for all the connectors including Raptor (CC: @hellium01 @highker ).

Put a cap on the amount of data read into a single page by monitoring filter selectivity and average row size and adjusting number of rows to read into next page.

I think it is already done in some sense in PageProcessor. Could you please elaborate a little bit more what improvements would you like to introduce?

mbasmanova commented 5 years ago

@arhimondr The plan is to implement all of these improvements inside of the Hive connector, e.g. they will only apply to ORC (and DWRF) readers. We hope that someone from the community would also implement similar optimizations for Parquet. The engine changes will be in 3 areas: (1) optimizer rule to extract subfield paths used in the query; (2) SPI changes to pushdown filters and referenced subfield paths; (3) a set of helper functions to compile filters and extract TupleDomain conjuncts.

Re: cap - LazyBlock doesn't provide any caps on total memory needed to load it. In the presence of very large rows, loading a single LazyBlock worth 10K rows may exceed memory limits. The new logic will adapt the number of rows being read based on observed row size and filter selectivity. Each column reader will then be provided with the number of rows to read and max memory to use. If memory limit is reached before the requested number of rows have been read, a retry with dramatically reduced number of rows will occur. Number of rows will increase gradually afterwards (if the data allows).

arhimondr commented 5 years ago

(1) optimizer rule to extract subfield paths used in the query;

Selection_059

(2) SPI changes to pushdown filters and referenced subfield paths;

Selection_059

(3) a set of helper functions to compile filters and extract TupleDomain conjuncts.

That requires some discussions.

Currently Presto has a clear level of abstraction between the engine, and the data source. The responsibility of the data source - is to only provide the data to the engine. The responsibility of the engine - is to process the data (apply filters, aggregations, joins). Moving filter evaluation to the connector degrade this clear separation, as it effectively moves the part of the execution to the connector itself.

Also the compiler framework Presto has is a very thoroughly designed and engineered piece of software. We should think very hard before deciding to go and re-implement that functionality in the Hive connector. We don't want to loose our current optimization in favour of new optimizations. Run-time filter reordering is a very promising optimization, but while implementing that we don't want to loose very efficient compilation of the complicated expression trees that we have right now.

a set of helper functions

If this can be done with a set of helper function - it implies some level of abstraction. Instead of reinventing a new level of abstraction does it make sense to first identify what is wrong with the abstraction we have so far? If there something wrong - is there a way to fix it?

Re: cap - LazyBlock doesn't provide any caps on total memory needed to load it. In the presence of very large rows, loading a single LazyBlock worth 10K rows may exceed memory limits. The new logic will adapt the number of rows being read based on observed row size and filter selectivity. Each column reader will then be provided with the number of rows to read and max memory to use. If memory limit is reached before the requested number of rows have been read, a retry with dramatically reduced number of rows will occur. Number of rows will increase gradually afterwards (if the data allows).

If we implemented the ReallyLazyBlock there would be no need to load 10K rows at a time. Is there a clear explanation why something like ReallyLazyBlock is not an option?

oerling commented 5 years ago

The point of Aria scan is to do everything right in the space of accessing columnar data sources.

One could like this to BigQuery, except that we are here a little more sophisticated with filtering inside repeated structures and adaptivity. We can think of this as a compilation of best practices from the column store world.

Filtering at the column level and maintaining a qualifying set of surviving rows is standard. The fact that baseline Presto does not do this is a culture shock to anybody coming from the database world. Reordering simple filters is also common. Running filter expressions as soon as their arguments are available is another self-evident matter.

There is no intrinsic merit to Presto bytecode generation in filters. This is just a loop over rows with ifs, virtual functions and LazyBlock checks inside.

Breaking functions into sets that depend on discrete sets of columns and doing these early and adaptively is just common sense. This does not particularly affect bytecode gen or its use. It just replaces an arbitrary static order with a dynamic one and makes error behavior be consistent and order independent, i.e. not dependent on the order in which the application writing the queries happens to place them.

The generated bytecode is not particularly good but this is not a pressing issue at this point because the major inefficiences are in the algorithmic structure of scan in general and not in the bytecode in specific.

The connector interface is a minimal interface that corresponds to something like a key value store. The implicit assumption is that the thing one connects to can be out of process.

This is not good for federation and this is also not good for potentially smart in-process data sources like the columnar readers. Degrading is not the word here. Further degrading of functionality would make this entirely unusable. But of course here degrading is used to mean adapting to outside reality. Normally this would be called upgrading.

Indeed, doing what makes sense through the connector interface is one of the top asks out there. And of course people do what they must on an ad hoc basis in any case, e.g. Uber, Alibaba.

Could one do something smart with columnar storage on the engine side, as opposed to the connector side? If the connector interface is something that could conceivably go out of process, the answer is no. If the connector interface produced column readers that had an explicitly in-process interface for seeking to positions and calling a callable for selected values, then the answer is maybe.

The natural interface for federation is query language pplus array parameters. The natural interface for scanning columnar data is a function that loops over positions and applies an operation to the values found there. These are at opposite ends of the stack. If a connector did both of these then this might just as well be two different concepts.

One could conceivably consider a single column scan as an operator. The positions for selected values would be foreign keys of a sort, consumed by the operator that scans the next column. A qualifying set would be an intermediate result column. Such schemes have been considered. MonetDB comes close to doing something like this in its intermediate language.

There are however coordination opportunities between columns that are unlike anything you find in other operators. For example, the notion of prefiltering on row group statistics. This would cause the interface to be quite a bit wider than normally between operators. This is why I am not aware of anybody doing this. A column reader is its own kind of entity, not an operator.

Having said this, it is conceivably possible to have a column scan engine that scans and filters different formats. As long as a column exposes a seek and a scan of selected offsets any format will do. This does not go very far though because as soon as we have nested data there are differences between say ORC and Parquet which make it so that anything dealing with nested data diverges.

LazyBlock is just a mistake meant to cover upp another mistake. The original mistake is not filtering by the column. The follow up mistake is covering up for this by skipping IO or materialization for columns that are very sparsely accessed. These both originate in the misguided idea that one can have a table-wide boundary between reading and selecting.

Suppose one attached a set of actually needed rows to LazyBlock. The place that triggers loading is the filter. This is a row-wise loop that accesses block after block until some condition is false. This model cannot propagate selectivity from first filter to last because in order to do this one would have to run these a column at a time. This is precisely what is accomplished in Aria by breaking down the code generated filter into distinct loops that depend on distinct inputs. And of course we run simple filters that compare columns to constants first, before materializing anything. So the place where a LazyBlock could be qualified by selected rows would b after the filters, where these are loaded anyway, so there would really be no point in laziness in any of these situations.

Some people have speculated on returning LazyBlocks from the scan operator. Again, if there were a very selective hash join, for example as the next operator there could be some point to this. But it is much easier to consider a selective hash join as just a filter expression and run it in the mix of filter expressions. If the hash table is unique, as it is in fact-dimension joins, it is truly a filter. If it is not unique but is still selective, one can still do a prefilter like Bloom or such. This is a time honored tradition, ever since the invisible join in Daniel Abadi's foundational thesis or even before then.

From: Andrii Rosa notifications@github.com Sent: Tuesday, April 9, 2019 7:18 AM To: prestodb/presto presto@noreply.github.com Cc: oerling erling@xs4all.nl; Mention mention@noreply.github.com Subject: Re: [prestodb/presto] Aria Scan Optimizations (#12585)

(1) optimizer rule to extract subfield paths used in the query;

https://user-images.githubusercontent.com/5570988/55806640-a9bc2580-5aae-11e9-9eb7-bc183e927763.png

(2) SPI changes to pushdown filters and referenced subfield paths;

https://user-images.githubusercontent.com/5570988/55806640-a9bc2580-5aae-11e9-9eb7-bc183e927763.png

(3) a set of helper functions to compile filters and extract TupleDomain conjuncts.

That requires some discussions.

Currently Presto has a clear level of abstraction between the engine, and the data source. The responsibility of the data source - is to only provide the data to the engine. The responsibility of the engine - is to process the data (apply filters, aggregations, joins). Moving filter evaluation to the connector degrade this clear separation, as it effectively moves the part of the execution to the connector itself.

Also the compiler framework Presto has is a very thoroughly designed and engineered piece of software. We should think very hard before deciding to go and re-implement that functionality in the Hive connector. We don't want to loose our current optimization in favour of new optimizations. Run-time filter reordering is a very promising optimization, but while implementing that we don't want to loose very efficient compilation of the complicated expression trees that we have right now.

a set of helper functions

If this can be done with a set of helper function - it implies some level of abstraction. Instead of reinventing a new level of abstraction does it make sense to first identify what is wrong with the abstraction we have so far? If there something wrong - is there a way to fix it?

Re: cap - LazyBlock doesn't provide any caps on total memory needed to load it. In the presence of very large rows, loading a single LazyBlock worth 10K rows may exceed memory limits. The new logic will adapt the number of rows being read based on observed row size and filter selectivity. Each column reader will then be provided with the number of rows to read and max memory to use. If memory limit is reached before the requested number of rows have been read, a retry with dramatically reduced number of rows will occur. Number of rows will increase gradually afterwards (if the data allows).

If we implemented the ReallyLazyBlock there would be no need to load 10K rows at a time. Is there a clear explanation why something like ReallyLazyBlock is not an option?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prestodb/presto/issues/12585#issuecomment-481271024 , or mute the thread https://github.com/notifications/unsubscribe-auth/Ap73z6wffzZPHxjRMFFv-0sQQEI-ff0Iks5vfKEAgaJpZM4cdrLk .

phd3 commented 5 years ago

@mbasmanova Thanks for working on these optimizations.

I have a question around the dereference pushdown in Unnest. Consider a table _companyprojects with the following schema:

teamid BIGINT teamprojects array(row(name varchar, peopleinfo row(manager varchar, engineer varchar)))
1 [ row(“project_a”, row(“foo1”, “bar1”)),  row(“project_b”, row(“foo2”, “bar2”)),  row(“project_c”, row(“foo3”, “bar3”))]
2 ……..

Query:

SELECT t.peopleinfo.engineer from company_projects cross join unnest(teamprojects) as t(projectid, peopleinfo)

Plan with Aria Scan:

While executing, will (a) the whole nested structure (name, manager and engineer fields) teamprojects get tossed around OR (b) just an array of varchars representing ‘engineer’ field will be consumed and wrapped as teamprojects

I don’t think (a) is the case because it defeats the purpose of pushdown. Is it? But if (b) is true, why do output types of TableScan and Unnest NOT reflect that?

mbasmanova commented 5 years ago

@phd3 Pratham, TableScan will produce arrays of {name, {manager, engineer}} structs with name and manager fields set to null and engineer field populated with actual values. Hence, the schema of TableScan output doesn't change, but unreferenced fields are just nulls.

I noticed that there is a bug in pruning logic. The query you mentioned should produce teamprojects[*].peopleinfo.engineer pruning path, but .peopleinfo element is missing. I'll fix that.

phd3 commented 5 years ago

@mbasmanova Thanks for the explanation. Just a heads-up, I hadn't pulled the test fix before producing that result, not sure if it already fixes the bug.

mbasmanova commented 4 years ago

The new scan is largely complete. Use session properties to enable:

set session pushdown_subfields_enabled=true;
set session hive.pushdown_filter_enabled=true;