trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.51k stars 3.03k forks source link

Implement Pushdown to Arbitrary Predicates Over Synthetic Columns at Split Generation Time to Access S3 #4808

Open larandvit opened 4 years ago

larandvit commented 4 years ago

Hello,

Optimization should be done when data is retrieved from S3 storage filtering the contents of objects. See SelectObjectContent document on operation filters of an S3 object.

The original discussion in slack: https://prestosql.slack.com/archives/CGB0QHWSW/p1597251503363200.

The samples of queries where optimization are not applied.

SELECT count(*)
FROM minio.abc.data
WHERE "$path" LIKE 's3a://bucket/20200310-04535-1/file-part-0004%';
SELECT count(*)
FROM minio.abc.data
WHERE substr("$path",1,81)='s3a://bucket/20200310-04535-1/file-part-0004';

Thanks.

grantatspothero commented 4 years ago

Metadata only queries against hidden columns would be nice too!

Right now, the hidden columns get tacked onto the rows output from the file which always requires reading the entire file. It would be nice if there were hidden tables that could be used to query this metadata only information. This might be another issue though.