trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.27k stars 2.95k forks source link

Presto Cannot read SQL type decimal(38,24) from ORC stream column name of type INT #3373

Open arghya18 opened 4 years ago

arghya18 commented 4 years ago

I am trying to read a external table from presto and for which the data type of one column changed from INT to decimal, so the older ORC files have the column as INT but newer one have decimal. So the datatype of current external table is decimal

But unfortunately presto is not able to read that column and throws the below error, however the datatype has expanded and expected to read the files with datatype as INT like HIVE is able to read it.

io.prestosql.spi.PrestoException: Error opening Hive split s3://<bucket>/2004070444/part-00000-a1862d3a-79da-4ed7-9aab-4efe62846750-c000.zlib.orc (offset=0, length=725091): Malformed ORC file. Cannot read SQL type decimal(38,24) from ORC stream .date_key of type INT [s3://<bucket>/2004070444/part-00000-a1862d3a-79da-4ed7-9aab-4efe62846750-c000.zlib.orc] at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:314) at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:147) at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:190) at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:108) at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:91) at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49) at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:61) at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:296) at io.prestosql.operator.Driver.processInternal(Driver.java:379) at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283) at io.prestosql.operator.Driver.tryWithLock(Driver.java:675) at io.prestosql.operator.Driver.processFor(Driver.java:276) at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075) at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163) at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484) at io.prestosql.$gen.Presto_331____20200407_145643_2.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.prestosql.orc.OrcCorruptionException: Malformed ORC file. Cannot read SQL type decimal(38,24) from ORC stream .date_key of type INT [s3://<bucket>/2004070444/part-00000-a1862d3a-79da-4ed7-9aab-4efe62846750-c000.zlib.orc] at io.prestosql.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:40) at io.prestosql.orc.reader.LongColumnReader.<init>(LongColumnReader.java:88) at io.prestosql.orc.reader.ColumnReaders.createColumnReader(ColumnReaders.java:38) at io.prestosql.orc.OrcRecordReader.createColumnReaders(OrcRecordReader.java:557) at io.prestosql.orc.OrcRecordReader.<init>(OrcRecordReader.java:236) at io.prestosql.orc.OrcReader.createRecordReader(OrcReader.java:275) at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:273) ... 18 more

sopel39 commented 4 years ago

@arghya18 Are you sure Hive reads correct decimal values for old files?

findepi commented 4 years ago

@arghya18 what if you SELECT date_key .. only? can you also run SET SESSION your-catalog-name.orc_use_column_names = true; before the query?

arghya18 commented 4 years ago

@arghya18 Are you sure Hive reads correct decimal values for old files?

@sopel39 : Yes HIVE is able to read the old files and return correct values.

arghya18 commented 4 years ago

@arghya18 what if you SELECT date_key .. only? can you also run SET SESSION your-catalog-name.orc_use_column_names = true; before the query?

@findepi : If we only select date_key then also its same error and if you exclude date_key then everything works. I have also tried running after setting your-catalog-name.orc_use_column_names = true; but got the same issue.

arghya18 commented 4 years ago

Any further update on this please ?

findepi commented 4 years ago

The fix would be to add support for decimal in LongColumnReader class + proper tests.

zcliang97 commented 3 years ago

Hello, I am running into this issue as well and just wanted to see if this issue is being worked on? Thanks

Subhashini2610 commented 1 year ago

Same issue here. Any fix exists?