trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.23k stars 2.95k forks source link

Use Hive Metastore API for listing partitions that allows range (and other) predicates on the partition keys #611

Open Parth-Brahmbhatt opened 5 years ago

Parth-Brahmbhatt commented 5 years ago

See thread https://prestosql.slack.com/archives/CFLB9AMBN/p1554842107205400.

Currently in HivePartitionManager we use getPartitionNamesByParts API from metastore which only allows a single value per partition predicate. This means any non equality partition predicate can not be pushed down to metastore which results in 2 problems:

In the newer version of Hive new APIs are introduced to avoid this issue. We should look into moving to get_partitions_by_expr or get_partitions_by_filter for types that are supported by these APIs.

electrum commented 5 years ago

This is a good idea. Do you know what Hive version these were added in? Hopefully they’re old enough to be available everywhere, otherwise we’ll need some fallback code (catch remote “no such method” error and call the old one).

Parth-Brahmbhatt commented 5 years ago

If I just go by when the API was added, then that was 9 years ago.

luohao commented 5 years ago

We have explored this a bit, but I think we should probably avoid using get_partitions_by_expr(PartitionsByExprRequest req). It's very very bad API design. It includes a Kryo serialized object in the request:

struct PartitionsByExprRequest {
...
 3: required binary expr,
...
} 

There are a few potential issues:

We should probably try get_partitions_by_filter. It uses a filter string, which is more standard.

electrum commented 5 years ago

The problem with get_partitions_by_expr, get_partitions_by_filter, and get_part_specs_by_filter is that they return the full partition metadata objects, which can be huge and take too long to fetch.

With the existing get_partition_names_ps call, we see issues with timeouts and Thrift response size limits on tables with many partitions. And that's just fetching the names, which is a single string, not the huge metadata object.

Presto fetches partition names while planning, then fetches the full partition metadata iteratively during split generation as the query executes. The engine provides the connector with a Constraint containing a black box Predicate that is used to further filter the name list, which allows expressions or functions over partition columns which cannot be expressed via TupleDomain.

I now remember looking into this in the past and coming to the above conclusion that there was no better API that does what we need (list partition names with a filter). However, they recently added get_partition_values in HIVE-17466 which seems to do exactly what we need. The PartitionValuesRequest struct has a filter field.

We should check if this API is available in CDH 5 and HDP 2.

Parth-Brahmbhatt commented 5 years ago

CDH 5 seems to be on Hive 1.1.0 + patches and the Jira you have linked does not seem to be in the list of patches

HDP-2.6.5 which is the most latest release of HDP-2 also only ships HIVE-2.1.0

No matter when we decide to move to a newer API I believe we will have to make it config enabled to support organization that are running on older version of hive servers so I do not see a reason not to implement this right now and make it config enabled.

electrum commented 5 years ago

I agree, there's no reason not to implement it now. We can do it without config if we make the fallback transparent. Have a boolean flag defaulting to true to indicate it is supported, then set it to false if we get a "no such method" error from the remote Thrift server. This means the first request after a server restart will take slightly longer, but this shouldn't cause any problems.

luohao commented 5 years ago

Agreed with @electrum.

As long as we have a fallback mechanism it should not break the compatibility. Hive adopts a similar strategy where a MetaException from HMS will make it falls back to the implementation of client-side filtering.

    430         try {
    431           hasUnknownPartitions = Hive.get().getPartitionsByExpr(
    432               tab, compactExpr, conf, partitions);
    433         } catch (IMetaStoreClient.IncompatibleMetastoreException ime) {
    434           // TODO: backward compat for Hive <= 0.12. Can be removed later.
    435           LOG.warn("Metastore doesn't support getPartitionsByExpr", ime);
    436           doEvalClientSide = true;
    437         } finally {
    438           perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PARTITION_RETRIEVING);
    439         }
findepi commented 5 years ago

get_partition_values was added in Hive 2.4, but this version is not downloadable (https://archive.apache.org/dist/hive/) and, AFAIK, we're not ready to require Hive 3.1 just yet. I created enhancement proposal to backport get_partition_values to Hive 2.3 branch (https://issues.apache.org/jira/browse/HIVE-21859).

findepi commented 5 years ago

get_partition_values has been merged to Hive 2.3 branch (https://issues.apache.org/jira/browse/HIVE-21859).

dannylinden commented 4 years ago

Here also @tooptoop4

rash67 commented 4 years ago

I'm taking a look at this right now thanks

rash67 commented 4 years ago

After a discussion with @electrum we decided to break this into pieces

  1. Refactor HiveMetastore interface to only have a getPartitionNamesByFilter() method, including the callers and implementations.
  2. Update GlueHiveMetastore to take advantage of the new API.
  3. Update ThriftHiveMetastore to conditionally use the old or new Thrift call. (edited)

I'm tidying up a PR for #1 and will attach today for feedback (consider WIP, still testing). It can be reviewed and committed separately, or bundled with the next steps. However, if the Glue implementation is the desired feature, I would recommend implementing and testing that without #3 first.

rash67 commented 4 years ago

I'm debugging some CI test failures with this commit, but if anyone wants start providing feedback on the refactor, that would be helpful. Once I've sorted out any test issues, I'll begin work on the HiveGlueMetastore implementation that translates TupleDomain -> a filter string. I'll do the BridgingHiveMetastore/ThriftMetastore translation

the first commit after the refactor will push any Domain -> wildcard translation into each HMS impl, likely.

rash67 commented 4 years ago

I'm pushing the conversion to List into implementations now and will begin work on the Glue impl + test cases after that.

rash67 commented 4 years ago

pushed conversion of TupleDomain -> List into each HiveMetastore impl

next GlueHiveMetastore implementation + tests

rash67 commented 4 years ago

see comments in the PR for progress. The PR/diff is around 1k, but includes some copy & pasted files (serDe related code)

rash67 commented 4 years ago

I'm working on testing out the GlueHiveMetastore implementation now. I'm adding a large number of test cases for the supported types and a variety of queries (equals, in clause, ranges, etc)

tooptoop4 commented 3 years ago

has this improvement been implemented for non-glue hive ? @rash67