trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.21k stars 2.94k forks source link

sql cannot run using Presto, but can run using hive, when the type of column in orc file doesn't match the type defined in hive table #3679

Open Xuchaojing opened 4 years ago

Xuchaojing commented 4 years ago

Query 20200509_032316_00682_9pmv8 failed: Error opening Hive split hdfs://xxxxxxx (offset=33554432, length=33554432): Malformed ORC file. Can not read SQL type bigint from ORC stream .id of type STRING

so is it any suggestion about this issue?

ebyhr commented 4 years ago

Could you try hive.orc.use-column-names=true?

Also, I would suggest you join the community Slack https://prestosql.io/slack.html

Xuchaojing commented 4 years ago

hive.orc.use-column-names it doesn't solve this issue when i add this property to catalog/hive.properties, restart the presto cluster. then run the sql, the error still occurs

findepi commented 4 years ago

(For hive.orc.use-column-names, please track https://github.com/prestosql/presto/issues/1558)

@Xuchaojing can you provide the full error stacktrace with and without hive.orc.use-column-names=true, as well as table schema (SHOW CREATE TABLE xxx )

Xuchaojing commented 4 years ago

in order to protect the security of the information, I changed the name of the table. change the real name to "abc". the following info is table schema :

CREATE TABLE `ods.abc`(
  `id` bigint,
  `identifier` string,
  `identifier_md5` string,
  `channel` bigint,
  `os_version` bigint,
  `request_url` string)
COMMENT 'abc'
PARTITIONED BY (
  `pt` string COMMENT 'full_partition')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
  'field.delim'='\u0001',
  'line.delim'='\n',
  'serialization.format'='\u0001')
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://testHadoop/dataware/ods/abc'
TBLPROPERTIES (
  'transient_lastDdlTime'='1574322027')

the full error stacktrace when run sql using presto: Query 20200509_161449_04577_9pmv8 failed: Error opening Hive split hdfs://testHadoop/dataware/ods/abc/pt=20200425/part-0f22af46_648b_45dd_896b_6cd088339140 (offset=134217728, length=33554432): Malformed ORC file. Can not read SQL type bigint from ORC stream .id of type STRING [hdfs://testHadoop/dataware/ods/abc/pt=20200425/part-0f22af46_648b_45dd_896b_6cd088339140]

ebyhr commented 4 years ago

@Xuchaojing You can get the full stactrace by --deubg CLI option or Web UI. (The pasted message isn't full stacktrace)

Xuchaojing commented 4 years ago

@Xuchaojing You can get the full stactrace by --deubg CLI option or Web UI. (The pasted message isn't full stacktrace)

this is the full stacktrace, and the version of the prestosql i used is 317.

Query 20200510_052523_00189_9pmv8 failed: Error opening Hive split hdfs://testHadoop/dataware/ods/abc/pt=20200426/part-__170d9cc5_7b28_4828_bc34_ad141740ac89 (offset=134217728, length=33554432: Malformed ORC file. Can not read SQL type bigint from ORC stream .id of type STRING [hdfs://testHadoop/dataware/ods/abc/pt=20200426/part-__170d9cc5_7b28_4828_bc34_ad141740ac89]
io.prestosql.spi.PrestoException: Error opening Hive split hdfs://testHadoop/dataware/ods/abc/pt=20200426/part-__170d9cc5_7b28_4828_bc34_ad141740ac89 (offset=134217728, length=33554432): Malformed ORC file. Can not read SQL type bigint from ORC stream .id of type STRING [hdfs://testHadoop/dataware/ods/abc/pt=20200426/part-__170d9cc5_7b28_4828_bc34_ad141740ac89]
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:235)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:121)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:164)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:98)
    at io.prestosql.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:45)
    at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
    at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:283)
    at io.prestosql.operator.Driver.processInternal(Driver.java:379)
    at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
    at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
    at io.prestosql.operator.Driver.processFor(Driver.java:276)
    at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)
    at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
    at io.prestosql.$gen.Presto_317_dirty____20200428_034006_1.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.prestosql.orc.OrcCorruptionException: Malformed ORC file. Can not read SQL type bigint from ORC stream .id of type STRING [hdfs://testHadoop/dataware/ods/abc/pt=20200426/part-__170d9cc5_7b28_4828_bc34_ad141740ac89]
    at io.prestosql.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:40)
    at io.prestosql.orc.reader.SliceStreamReader.<init>(SliceStreamReader.java:63)
    at io.prestosql.orc.reader.StreamReaders.createStreamReader(StreamReaders.java:48)
    at io.prestosql.orc.OrcRecordReader.createStreamReaders(OrcRecordReader.java:559)
    at io.prestosql.orc.OrcRecordReader.<init>(OrcRecordReader.java:240)
    at io.prestosql.orc.OrcReader.createRecordReader(OrcReader.java:249)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:205)
ebyhr commented 4 years ago

@Xuchaojing Could you share two kinds of full stacktrace with:

  1. hive.orc.use-column-names=true
  2. hive.orc.use-column-names=false
Xuchaojing commented 4 years ago
  • hive.orc.use-column-names=true

the error stack is the same, whether true or false.

ebyhr commented 4 years ago

Let's continue in this Slack thread https://prestosql.slack.com/archives/CGB0QHWSW/p1589011535253100

ArvinZheng commented 4 years ago

I think this is because the ORC reader that Hive uses converts the data type from File Type to Reader Type when possible, but Presto does not support the conversion currently.