paradedb / pg_analytics

DuckDB-powered analytics for Postgres
https://paradedb.com
GNU Affero General Public License v3.0
148 stars 10 forks source link

Support custom partitioning schemes based on "patterns" in the OPTIONS files parameter of CREATE FOREIGN TABLE #79

Open pdpark opened 1 month ago

pdpark commented 1 month ago

What feature are you requesting?

The ability to specify a custom partitioning scheme through the use of a pattern in the files option when creating foreign tables, like this:

CREATE FOREIGN TABLE my_table ()
SERVER parquet_server
OPTIONS (files 's3://bucket/data_{id_1}_{id_2}.parquet')

Why are you requesting this feature?

To support existing custom partitioning scheme.

What is your proposed implementation for this feature?

Foreign tables could be created like this:

CREATE FOREIGN TABLE my_table ()
SERVER parquet_server
OPTIONS (files 's3://bucket/data_{id_1}_{id_2}.parquet')

...or this:

CREATE FOREIGN TABLE my_table ()
SERVER parquet_server
OPTIONS (files 's3://bucket/data_{id_1}_*.parquet')

The values in brackets must correspond with column names defined in the referenced parquet files or the statement will fail.

When running a query like this on the first table defined above:

select *
from my_table
where id_1 = '1234'
and id_2 = '0987'

...the id_1 and id_2 column values from the sql where clause will be substituted into the files pattern producing a string that must correspond with an actual parquet file at the specified s3 location:

s3://bucket/data_1234_0987.parquet

A query on the second table table defined above:

select *
from my_table
where id_1 = '1234'

...will produce a files pattern after substitution that looks like this:

s3://bucket/data_1234_*.parquet

Full Name:

Patrick Park

Affiliation:

Payzer

Weijun-H commented 1 week ago

~It seems that we could introduce hive_partitioning setting to fix this ticket.~

philippemnoel commented 1 week ago

~It seems that we could introduce hive_partitioning setting to fix this ticket.~

@shamb0 has made a PR to document hive partitioned, we just need to review and merge it. As for custom partitioning scheme that is not Hive, ~I'm not convinced we want to expose that as it is probably an edge case. Unless you have an idea that we haven't considered~

EDIT: We're still open to considering this, but are waiting for more user requests

Weijun-H commented 1 week ago

~It seems that we could introduce hive_partitioning setting to fix this ticket.~

@shamb0 has made a PR to document hive partitioned, we just need to review and merge it. As for custom partitioning scheme that is not Hive, I'm not convinced we want to expose that as it is probably an edge case. Unless you have an idea that we haven't considered

I see, it makes sense to me. Btw we should add an example in https://docs.paradedb.com/ingest/import/parquet#parquet-options for the hive partitioned

philippemnoel commented 1 week ago

~It seems that we could introduce hive_partitioning setting to fix this ticket.~

@shamb0 has made a PR to document hive partitioned, we just need to review and merge it. As for custom partitioning scheme that is not Hive, I'm not convinced we want to expose that as it is probably an edge case. Unless you have an idea that we haven't considered

I see, it makes sense to me. Btw we should add an example in https://docs.paradedb.com/ingest/import/parquet#parquet-options for the hive partitioned

Agreed