trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.48k stars 3.01k forks source link

Add the IndexedTpch connector to the standard set of available connectors #1286

Open am314159 opened 5 years ago

am314159 commented 5 years ago

Presto comes with the IndexedTpch connector, in package io.prestosql.tests.tpch. This connector should provide a useful example of the way that Presto understands indexes, and sample data to try them out on. A user could use it to experiment with Presto indexes when they want to understand its capabilities.

The IndexedTpch connector is not listed in the user documentation as one of the standard connectors and (I believe) it cannot be added to the catalog simply creating a properties file with _connector.name = tpchindexed. This makes it hard to try out.

Can we make this connector one of the standard set of connectors and provide some user documentation?

findepi commented 5 years ago

It's not documented and not possible to configure because the IndexedTpch exists for test purposes only. (In fact, it's implemented in tests module.) From user perspective, it should behave exactly the same way as tpch connector -- you should get the same query results. I don't see yet the value of providing the indexed tpch connector as part of presto-server installation. Please help me understand.

electrum commented 5 years ago

I can see this being useful to see what different query plans look like. It’s not useful to normal end users, but could be useful for developers. Having it built in would allow for easier testing via Docker, without needing to launch it in the IDE.

am314159 commented 4 years ago

Thanks, @findepi and @electrum. Apologies for taking so long to reply.

The fundamental why I want to be able to use the IndexedTpch connector is to understand how and when (and if) Presto will make use of indexes in its underlying data stores, and when instead it will use its default behaviour of a full-table scan and a hash join.

If you take a look at issue #1285 then you can see some of the questions that I would like to be able to answer:

With this in mind, @findepi, what you say is correct: it should not matter to a user whether the queries that you run are executed using an index-lookup, a nested-loop or a hash-join. The only difference that you, as a user, should notice is a difference in the time that it takes to execute the query.

However, @electrum has grasped the crucial difference. There appears to be the capability within the Presto source code to support index-based queries, and so (presumably) nested-loop joins, but very few of the connectors seem to support index use. The IndexedTpch connector is one the very few which does, but it is not easy to get it up and running.

In a perfect world, Presto would use (for example) an ElasticSearch index when it is available and it would join two ElasticSearch tables together using a nested-loop-style join. This kind of join is only efficient to do when the query is sufficiently precise and returns only a few rows from each table in the join.

In the right circumstances, such as "retrieve all the details and orders involving customer X", a nested-loop is far and away the fastest query plan, so long as the underlying tables are correctly indexed.

I don't believe that the ElasticSearch connector has index-use implemented yet - do please tell me if I am mistaken - but the IndexedTpch connector may do. This is why it would be good to have it as a standard easy-to-run connector.

At the moment, this index-usage capability is hidden deep in the connector API specification and you have to look to know it is even there. Even now, I am not sure whether it actually works in practice. It is really easy to assume that Presto is a bulk-scanning engine (much like, say, Apache Spark). If the index-lookups do work, Presto could have the ability to perform very fast index-based joins when handling very precise queries that return just a few rows - and that is something that Spark cannot do.

I guess what this issue (and issue #1285) are all about is exposing and explaining this capability to the more sophisticated users - good database administrators have always needed to understand how their database works in order to get good performance, after all.

Does that make sense? Please carry on discussing it and I'll try to give a more timely response next time.