paradedb / pg_analytics

DuckDB-powered analytics for Postgres
https://paradedb.com
GNU Affero General Public License v3.0
129 stars 9 forks source link

Support multi-level partition tables #56

Open pdpark opened 1 month ago

pdpark commented 1 month ago

What feature are you requesting?

Support for foreign tables in a multi-level partitioned table setup (see example).

Why are you requesting this feature?

When you have a lot of large files in one object store key/directory, scanning all of the files can be prohibitively time consuming. Changing the storage structure to support hive keys may not always be an option.

What is your proposed implementation for this feature?

This is an example of a multi-level partition table, but in many cases only the first partition level would be needed.

Example:

A “root” table that will have two partition “levels”:

CREATE TABLE my_table(
  <column definitions>
) PARTITION BY list(id_1);

The table for the first partition level:

create table if not exists my_table_123
partition of my_table
for values in ('123') partition by list(id_2);

Many partition tables may be created at this level - one for each id_1 value in the root table.

Note: If these first two tables have to be created as foreign tables in order for the last partition level table to be a foreign table, they shouldn’t require a files option.

The table for the next partition level looks like this:

create foreign table if not exists my_table_123_abc
partition of my_table_123
for values in ('abc')
options (files 's3://<bucket_name>/.../<filename>.parquet');

Many partition tables may be created at this level as well - one for each id_2 value in the first partition level tables.

Note: when creating this final “leaf” partition table as a foreign table, a files option should be required.

So, when querying the root table with a where clause like this:

select *
from my_table
where id_1 = '123'
and   id_2 = 'abc'

...only one parquet file should be scanned - the one specified by the files option of the “leaf” partition table. Or, if a glob pattern is used in the files option of the leaf partition table, only the (ideally) few number of matching files will be scanned.

Full Name:

Patrick Park

Affiliation:

Payzer

shamb0 commented 1 month ago

Hi @philippemnoel , I am interested to work on this feature, could you please assign it to me

philippemnoel commented 1 month ago

Hi @philippemnoel , I am interested to work on this feature, could you please assign it to me

Absolutely, it is yours! Thank you for your work :)