trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.28k stars 2.96k forks source link

Support type evolution from double in Parquet to varchar in table/partition #6581

Open iammehrabalam opened 3 years ago

iammehrabalam commented 3 years ago

Earlier we were storing score data as a string but now we are storing it as double. Due to this data across partitions is inconsistent.

So to read the score we created a presto table

CREATE TABLE demo_table (
    eid varchar,
    score varchar, 
    pkey varchar
) WITH (
    external_location = 'hdfs://path/to/folder',
    format = 'PARQUET',
    partitioned_by = ARRAY['pkey']
)

Basically, we want to read the score as varchar instead of double because of inconsistency. Queries working in the hive but not in presto.

hive> select * from demo_table limit 10;
OK
C1  0.9863184140487596
C2  0.966310728943439
C3  0.982657427820512
C4  0.9885486818622775
C5  0.9867805453687933
C6  0.9914695210540662
C7  0.9847466290034234
C8  0.9460807112923405
C9  0.9898082352413242
C0  0.9779168018767527

But when execute the below query in presto getting an exception select * from demo_table limit 10;

java.lang.UnsupportedOperationException: io.prestosql.spi.type.VarcharType
    at io.prestosql.spi.type.AbstractType.writeDouble(AbstractType.java:103)
    at io.prestosql.parquet.reader.DoubleColumnReader.readValue(DoubleColumnReader.java:32)
    at io.prestosql.parquet.reader.PrimitiveColumnReader.lambda$readValues$2(PrimitiveColumnReader.java:183)
    at io.prestosql.parquet.reader.PrimitiveColumnReader.processValues(PrimitiveColumnReader.java:203)
    at io.prestosql.parquet.reader.PrimitiveColumnReader.readValues(PrimitiveColumnReader.java:182)
    at io.prestosql.parquet.reader.PrimitiveColumnReader.readPrimitive(PrimitiveColumnReader.java:170)
    at io.prestosql.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:262)
    at io.prestosql.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:314)
    at io.prestosql.parquet.reader.ParquetReader.readBlock(ParquetReader.java:297)
    at io.prestosql.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:164)
    at io.prestosql.spi.block.LazyBlock$LazyData.load(LazyBlock.java:381)
    at io.prestosql.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:360)
    at io.prestosql.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:276)
    at io.prestosql.spi.Page.getLoadedPage(Page.java:279)
    at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:304)
    at io.prestosql.operator.Driver.processInternal(Driver.java:379)
    at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
    at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
    at io.prestosql.operator.Driver.processFor(Driver.java:276)
    at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
    at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
    at io.prestosql.$gen.Presto_345____20210112_103211_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
hashhar commented 3 years ago

Can you please mention the Hive version and distribution you are using? Different Hive versions allow different coercions.

This has the same root cause as https://github.com/trinodb/trino/issues/2817 (missing type coercions between numeric and varchar types).

Numeric to varchar type coercion is not covered by existing cases in TestHiveCoercion too.