trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
9.88k stars 2.86k forks source link

Presto read ORC error : Malformed ORC file. #6070

Open qfrtrt opened 3 years ago

qfrtrt commented 3 years ago

Presto server version: 344 But in version 0.214, this SQL can be executed successfully. SQL: select checked from hive.dw_dwb.dwb_accounting_accounts_day where dt = '2020-06-30' limit 10; full stacktrace:

io.prestosql.spi.PrestoException: Error opening Hive split hdfs://ns1/user/hive/warehouse/dw_dwb.db/dwb_accounting_accounts_day/dt=2018-06-30/dh=00/000007_1 (offset=0, length=9186920): Malformed ORC file. Cannot read SQL type 'integer' from ORC stream '._col28' of type BYTE with attributes {} [hdfs://ns1/user/hive/warehouse/dw_dwb.db/dwb_accounting_accounts_day/dt=2018-06-30/dh=00/000007_1]
 at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:396)
 at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:162)
 at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:175)
 at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:105)
 at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:66)
 at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:64)
 at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:298)
 at io.prestosql.operator.Driver.processInternal(Driver.java:379)
 at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
 at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
 at io.prestosql.operator.Driver.processFor(Driver.java:276)
 at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
 at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
 at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
 at io.prestosql.$gen.Presto_344____20201123_074321_2.run(Unknown Source)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.prestosql.orc.OrcCorruptionException: Malformed ORC file. Cannot read SQL type 'integer' from ORC stream '._col28' of type BYTE with attributes {} [hdfs://ns1/user/hive/warehouse/dw_dwb.db/dwb_accounting_accounts_day/dt=2018-06-30/dh=00/000007_1]
 at io.prestosql.orc.reader.ReaderUtils.invalidStreamType(ReaderUtils.java:45)
 at io.prestosql.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:32)
 at io.prestosql.orc.reader.ByteColumnReader.<init>(ByteColumnReader.java:77)
 at io.prestosql.orc.reader.ColumnReaders.createColumnReader(ColumnReaders.java:52)
 at io.prestosql.orc.OrcRecordReader.createColumnReaders(OrcRecordReader.java:563)
 at io.prestosql.orc.OrcRecordReader.<init>(OrcRecordReader.java:241)
 at io.prestosql.orc.OrcReader.createRecordReader(OrcReader.java:330)
 at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:341)
 ... 17 more

And I noticed a similar error in https://github.com/prestosql/presto/issues/3679. By use the solution in that issue: SET SESSION hive.orc_use_column_names=true; Another error occur:

Query 20201124_095709_00085_9b3b2 failed: ORC file does not contain column names in the footer: hdfs://ns1/user/hive/warehouse/dw_dwb.db/dwb_accounting_accounts_day/dt=2018-06-30/dh=00/000007_1
io.prestosql.spi.PrestoException: ORC file does not contain column names in the footer: hdfs://ns1/user/hive/warehouse/dw_dwb.db/dwb_accounting_accounts_day/dt=2018-06-30/dh=00/000007_1
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.verifyFileHasColumnNames(OrcPageSourceFactory.java:413)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:265)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:162)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:175)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:105)
    at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:66)
    at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:64)
    at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:298)
    at io.prestosql.operator.Driver.processInternal(Driver.java:379)
    at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
    at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
    at io.prestosql.operator.Driver.processFor(Driver.java:276)
    at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
    at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
    at io.prestosql.$gen.Presto_344____20201123_074321_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
sopel39 commented 3 years ago

Could you provide more details? I also recommend that you join prestosql slack channel

sjx782392329 commented 3 years ago

We had join and reported this problem in the slack#general

findepi commented 3 years ago

We had join and reported this problem in the slack#general

Ref: https://prestosql.slack.com/archives/CFLB9AMBN/p1606212146472500

sjx782392329 commented 3 years ago

basic information: The hive version that presto 0.214 depends on is hive-apache-1.2.2-2.jar The hive version that presto 344 depends on is hive-apache-3.1.2-4.jar

phenomenon: I plan to upgrade presto's version from 0.214 to 344. When I use presto server (version v344) query hive data, I received error as above. But I query the same statement by using presto 0.214 is normal.

My attempt: I also use the hive engine to query hive data. When I use hive version1.2.1,I had a problem Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable (state=,code=0) . It seems that the conversion of long type to int type failed. When I use hive 3.1.x, the above query is normal.

Question

  1. Why i use the 344 version presto cannot query my data.
  2. The problematic field is an int type,I don't know why Presto use ByteColumnReader read the orc column. I think Presto should use LongColumnReader instead of ByteColumnReader
findepi commented 3 years ago

The hive version that presto 344 depends on is hive-apache-1.2.2-2.jar The hive version that presto 344 depends on is hive-apache-3.1.2-4.jar

i guess this is a typo here. Which Presto version did you mean in the first line?

Anyway, this is not relevant for reading ORC files, since Presto AFAICT does not use Hive code on ORC read path.

Can you please provide the steps to reproduce the problem?

sjx782392329 commented 3 years ago

I use a test hive table reproduce this problem. Below is my table building statement.

CREATE TABLE `student`(
`id` int,
`name` string,
`sex` string,
`age` int)
PARTITIONED BY(
`dt` string)
ROW FORMAT SERDE                                                                                              
   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                                                                 
 STORED AS INPUTFORMAT                                                                                         
   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                                                           
 OUTPUTFORMAT                                                                                                  
   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

Then I insert a row data. INSERT INTO student PARTITION(dt='2020-11-27') VALUES(1,"tom","man",18); At this point, both 0.214 and 344 versions of Presto can be queried normally. Then I modified the type of age, int -> tinyint ALTER TABLE student CHANGE COLUMN age age TINYINT; Before I insert new data, neither version 0.214 nor version 344 of Presto can read data about tom. There is the error information. 0.214 version

Query 20201127_084851_22720_298c2 failed: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'age' in table 'presto_test.student' is declared as type 'tinyint', but partition 'dt=2020-11-27' declared column 'age' as type 'int'.
com.facebook.presto.spi.PrestoException: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'age' in table 'presto_test.student' is declared as type 'tinyint', but partition 'dt=2020-11-27' declared column 'age' as type 'int'.
    at com.facebook.presto.hive.HiveSplitManager.lambda$getPartitionMetadata$2(HiveSplitManager.java:315)
    at com.google.common.collect.Iterators$6.transform(Iterators.java:788)
    at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
    at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
    at com.google.common.collect.Iterators$ConcatenatedIterator.hasNext(Iterators.java:1340)
    at com.facebook.presto.hive.ConcurrentLazyQueue.poll(ConcurrentLazyQueue.java:37)
    at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:252)
    at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:91)
    at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:185)
    at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
    at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
    at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
    at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

344 version

Query 20201127_082722_00014_sza7s failed: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'checked' in table 'default.test_dwb_accounting_accounts_day' is declared as type 'tinyint', but partition 'dt=2018-06-30/dh=00' declared column 'checked' as type 'int'.
io.prestosql.spi.PrestoException: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'checked' in table 'default.test_dwb_accounting_accounts_day' is declared as type 'tinyint', but partition 'dt=2018-06-30/dh=00' declared column 'checked' as type 'int'.
    at io.prestosql.plugin.hive.HiveSplitManager.tablePartitionColumnMismatchException(HiveSplitManager.java:429)
    at io.prestosql.plugin.hive.HiveSplitManager.getTableToPartitionMapping(HiveSplitManager.java:388)
    at io.prestosql.plugin.hive.HiveSplitManager.lambda$getPartitionMetadata$2(HiveSplitManager.java:343)
    at com.google.common.collect.Iterators$6.transform(Iterators.java:783)
    at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
    at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
    at com.google.common.collect.Iterators$ConcatenatedIterator.hasNext(Iterators.java:1333)
    at io.prestosql.plugin.hive.ConcurrentLazyQueue.poll(ConcurrentLazyQueue.java:37)
    at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:317)
    at io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:250)
    at io.prestosql.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
    at io.prestosql.$gen.Presto_344____20201127_074300_2.run(Unknown Source)
    at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)

I think this kind of error message is normal Then, I insert into new data INSERT INTO student PARTITION(dt='2020-11-27') VALUES(2,"jerry","man",22); At this time, the two versions of Presto cannot read the data correctly, even if the filter condition only filters name = jerry

Then, I changed the type of the age field from tinyint to int At this time, the 0.214 version can read the data of tom and jerry normally, but the 344 version fails to read the data.

The error message of version 344 is as follows:

Query 20201127_090012_00027_sza7s failed: Error opening Hive split hdfs://ns1/user/hive/warehouse/presto_test.db/student/dt=2020-11-27/000000_0_copy_1 (offset=0, length=452): Malformed ORC file. Cannot read SQL type 'integer' from ORC stream '._col3' of type BYTE with attributes {} [hdfs://ns1/user/hive/warehouse/presto_test.db/student/dt=2020-11-27/000000_0_copy_1]
io.prestosql.spi.PrestoException: Error opening Hive split hdfs://ns1/user/hive/warehouse/presto_test.db/student/dt=2020-11-27/000000_0_copy_1 (offset=0, length=452): Malformed ORC file. Cannot read SQL type 'integer' from ORC stream '._col3' of type BYTE with attributes {} [hdfs://ns1/user/hive/warehouse/presto_test.db/student/dt=2020-11-27/000000_0_copy_1]
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:396)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:162)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:175)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:105)
    at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:66)
    at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:64)
    at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:298)
    at io.prestosql.operator.Driver.processInternal(Driver.java:379)
    at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
    at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
    at io.prestosql.operator.Driver.processFor(Driver.java:276)
    at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
    at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
    at io.prestosql.$gen.Presto_344____20201127_074300_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.prestosql.orc.OrcCorruptionException: Malformed ORC file. Cannot read SQL type 'integer' from ORC stream '._col3' of type BYTE with attributes {} [hdfs://ns1/user/hive/warehouse/presto_test.db/student/dt=2020-11-27/000000_0_copy_1]
    at io.prestosql.orc.reader.ReaderUtils.invalidStreamType(ReaderUtils.java:45)
    at io.prestosql.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:32)
    at io.prestosql.orc.reader.ByteColumnReader.<init>(ByteColumnReader.java:77)
    at io.prestosql.orc.reader.ColumnReaders.createColumnReader(ColumnReaders.java:52)
    at io.prestosql.orc.OrcRecordReader.createColumnReaders(OrcRecordReader.java:563)
    at io.prestosql.orc.OrcRecordReader.<init>(OrcRecordReader.java:241)
    at io.prestosql.orc.OrcReader.createRecordReader(OrcReader.java:330)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:341)
    ... 17 more

This error message is the same as the error message we encountered. Our company changed the field type and then changed it back.

I hope prestosql can fix this problem as soon as possible, and we can upgrade from version 0.214 to the new version as soon as possible. @findepi Please cc~

sjx782392329 commented 3 years ago

@findepi @dain @martin @electrum Please help me, I have to solve this before I upgrade Presto from 0.214 to 344

findepi commented 3 years ago

@sjx782392329 did you try

SET SESSION hive.orc_use_column_names=true;

does it help?

sjx782392329 commented 3 years ago

@sjx782392329 did you try

SET SESSION hive.orc_use_column_names=true;

does it help?

I had try exec this commamd, but I encounter the new problem. @findepi

Query 20201211_064952_01803_pfqxh failed: ORC file does not contain column names in the footer: hdfs://ns1/user/hive/warehouse/guazi_dw_dwb.db/dwb_accounting_accounts_day/dt=2018-06-30/dh=00/000009_1
io.prestosql.spi.PrestoException: ORC file does not contain column names in the footer: hdfs://ns1/user/hive/warehouse/guazi_dw_dwb.db/dwb_accounting_accounts_day/dt=2018-06-30/dh=00/000009_1
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.verifyFileHasColumnNames(OrcPageSourceFactory.java:413)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:265)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:162)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:175)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:105)
    at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:66)
    at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:64)
    at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:298)
    at io.prestosql.operator.Driver.processInternal(Driver.java:379)
    at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
    at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
    at io.prestosql.operator.Driver.processFor(Driver.java:276)
    at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
    at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
    at io.prestosql.$gen.Presto_344____20201209_092958_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
findepi commented 3 years ago

cc @djsstarburst