trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.18k stars 2.94k forks source link

Date partition projection limited to a date_projection_interval_unit of no greater than days in trino-hive plugin #22760

Open MialLewis opened 1 month ago

MialLewis commented 1 month ago

Date partition projection is limited to a date_projection_interval_unit of no greater than DAYS in the trino-hive plugin.

I have a scenario where a external data is partitioned by a date key formatted yyyy-MM, with an interval unit of MONTHS and an interval value of 1.

I appreciate that this restriction is due to issues with replicating the exact implementation in Athena: https://github.com/trinodb/trino/blame/c3ee39a64df1923df41ac2c95e8ea3e10b45fe4b/plugin/trino-hive/src/main/java/io/trino/plugin/hive/projection/DateProjection.java#L66

Ideally this restriction would be lifted.

findinpath commented 1 month ago

@zhaner08 / @pettyjamesm can you please share the "secret sauce" in this regard on how AWS Athena is doing the increments on months?

pettyjamesm commented 1 month ago

I took a look at the code, and some notable differences that might impact matching the behavior of our implementation are:

Otherwise, the logic for computing the values is basically the same, with no special handling for ChronoUnit.MONTHS. Values are generated with a loop like:

    // relevant fields
    private final LocalDateTime minDateTime;
    private final LocalDateTime maxDateTime;
    private final ChronoUnit intervalUnit;
    private final long intervalSize;
    private final DateTimeFormatter dateFormatter;

    // relevant logic for iterating values
    public Iterator<String> getProjectedValues()
    {
        return new Iterator<String>()
        {
            private LocalDateTime next = minDateTime;

            @Override
            public boolean hasNext()
            {
                return !next.isAfter(maxDateTime);
            }

            @Override
            public String next()
            {
                String value = next.format(dateFormatter);
                next = next.plus(intervalSize, intervalUnit);
                return value;
            }
        };
    }
daveoy commented 2 weeks ago

would love to see a resolution to this. its especially confusing when creating a table using partitioning because you get conflicting messages from the coordinator regarding syntax:

Query 20240826_193927_00017_p3s5b failed: line 13:154: Unable to set catalog 'hive' column property 'partition_projection_interval_unit' to ['YEAR']: Invalid value [YEAR]. Valid values: [NANOS, MICROS, MILLIS, SECONDS, MINUTES, HOURS, HALF_DAYS, DAYS, WEEKS, MONTHS, YEARS, DECADES, CENTURIES, MILLENNIA, ERAS, FOREVER]
create table hive.table.name (
            index varchar,
            name varchar,
            vendor varchar,
            site varchar,
            start_datetime varchar,
            end_datetime varchar,
            duration varchar,
            duration_hr varchar,
            week varchar,
            month varchar,
            region varchar,
            pyear date WITH (partition_projection_type = 'DATE', partition_projection_format = 'yyyy', partition_projection_range = ARRAY['2024','NOW'], partition_projection_interval_unit = 'YEAR', partition_projection_interval = 1),
            pmonth date WITH (partition_projection_type = 'DATE', partition_projection_format = 'MM', partition_projection_range = ARRAY['01','12'], partition_projection_interval_unit = 'MONTH', partition_projection_interval = 1),
            pday date WITH (partition_projection_type = 'DATE', partition_projection_format = 'dd', partition_projection_range = ARRAY['01','31'], partition_projection_interval_unit = 'DAY', partition_projection_interval = 1),
            phour date WITH (partition_projection_type = 'DATE', partition_projection_format = 'HH', partition_projection_range = ARRAY['00','23'], partition_projection_interval_unit = 'HOUR', partition_projection_interval = 1),
            pminute date WITH (partition_projection_type = 'DATE', partition_projection_format = 'mm', partition_projection_range = ARRAY['00','59'], partition_projection_interval_unit = 'MINUTE', partition_projection_interval = 1),
            psecond date WITH (partition_projection_type = 'DATE', partition_projection_format = 'ss', partition_projection_range = ARRAY['00','59'], partition_projection_interval_unit = 'SECOND', partition_projection_interval = 1)
        )
        WITH (
            external_location = 's3a://bucket-name/some/path/',
            format = 'CSV',
            skip_header_line_count = 1,
            partitioned_by = ARRAY['pyear', 'pmonth', 'pday', 'phour', 'pminute', 'psecond'],
            partition_projection_enabled = true,
            partition_projection_location_template = 's3a://bucket-name/some/path/${pyear}/${pmonth}/${pday}/${phour}/${pminute}/${psecond}/'
        )

but on correcting those partition_projection_interval_unit to be plural, as required by the error msg above, the coordinator then complains about the restricted available units:

Query 20240826_201021_00054_p3s5b failed: Column projection for column 'pyear' failed. Property: 'partition_projection_interval_unit' value 'Years' is invalid. Available options: [Days, Hours, Minutes, Seconds]

this leaves me with INJECTED types for year and month as my only options for those partitions,

            pyear varchar WITH (partition_projection_type = 'INJECTED'),
            pmonth varchar WITH (partition_projection_type = 'INJECTED'),

subscribing for updates