Support complex condition on synthetic $path column

trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

https://trino.io

Apache License 2.0

10.4k stars 3k forks source link

Support complex condition on synthetic $path column #2422

Open bennysenone opened 4 years ago

bennysenone commented 4 years ago

We want to achieve an effect similar to bucketing in a hive table by predicate push down on the synthetic $path column. Right now, this only works for equality clause or IN clause, but does not work when a complex condition is used, such as "$path" LIKE '%xyz%' . Per discussion on the slack thread https://prestosql.slack.com/archives/CFLB9AMBN/p1578312918226100 It seems that it might be possible to intervene between the directory listing of a hive partition, and before split generation

ddrinka commented 4 years ago

A great first step for me would be to allow prefix restriction, since this maps directly to S3 listing operations. If we could detect "$path" LIKE 's3://bucket/prefix/subkey_%' and map this to an S3 ListObjects with a prefix restriction set, it would meet the vast majority of my pseudo-bucketing requirements, as we carefully name S3 keys for such queries.

For instance, we have some objects named like this: s3://bucket/objectType/year=2019/month=11/day=15/hour=02/minute=47/source=81/data.orc While partitions are seemingly reasonable for data organized like this, this data is used for ingest and contains hundreds of thousands of objects per day, which are, over time, sorted and partitioned into production tables.

In the current implementation, if days are left behind in these ingest buckets, the S3 ListObjects operation begins taking tens of minutes, when only a hundred or so keys actually match the query.

bennysenone commented 4 years ago

We have a somewhat different use case because all our queries always pass the partition keys, however, the physical location of the partitions does not follow hive convention (we relocate partitions to other buckets with random prefixes to mitigate S3 object name common prefix that causes it to sit in the same S3 partition) We are looking at Apache Iceberg to overcome the partition scaling limit of the hive metastore

ddrinka commented 4 years ago

@bennysenone I'm also looking forward to Iceberg...

FYI, the best practice of randomizing key prefixes to optimize S3 hash partitioning is no longer relevant. They fixed this in 2018: https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/

bennysenone commented 4 years ago

We had a lengthy discussion with AWS solution architects very recently because we kept getting hit by the S3 API rate limit. And the conclusion was that to be able leverage S3's automatic range partitioning, you need to introduce randomness as early as possible in the key name (key being the full path). Our ingestion pipeline performs online compaction of 3 minute ORC files (around 20MB each) into larger ones (up to 500MB)

ddrinka commented 4 years ago

@bennysenone That's surprising and concerning to hear, as the S3 documentation appeared very clear on this matter.

There's additional support for what you're describing here: https://stackoverflow.com/a/56155997

It sounds like what AWS changed isn't the partitioning scheme, but just the rate limits per partition, so that most applications won't hit the limits anymore. I find their public documentation of this to be really misleading, and it's helpful to have this new perspective.

That said, there's some conjecture in that same SO about whether the entire prefix is now hashed (everything before the last /), or just the first few characters as before. It sounds like you have that settled with your support team, and that randomness really needs to be introduced early in the key name?

bennysenone commented 4 years ago

According to AWS support, S3 dynamically decides if to split a key range based on access patterns. Each key range is indeed bound by the 5500 API calls/sec . Note that S3 can also do the merge back key ranges . So if the keys share a common prefix, which unfortunately we couldnt get a conclusive length , S3 internals wont split it. Therefore by smearing the prefix (we pre-pended a 4 character random ), it would smear the access across all ranges in the internal partition, there by triggering the key range split.