trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.21k stars 2.94k forks source link

Error occur "Can not read SQL type timestamp from ORC stream .long_type of type LONG" since prestosql-308 #1107

Closed archongum closed 5 years ago

archongum commented 5 years ago

(Before all)
This error occurs since prestosql-308. According to my tests:


Re-produce steps

0. Presto version

I use 307 and 308.
Do the following steps on 307 or 308. (I did these on 307 and I think it doesn't matter on which version)

1. Create table a_307

The columns explain themself.

CREATE TABLE hive."default".a_307 (
   int_type int,
   long_type bigint,
   date_type date,
   timestamp_type timestamp
)
WITH (
   -- change location
   external_location = 'hdfs://datacenter1:8020/tmp/307/',
   format = 'ORC'
);

2. Insert a record

insert into hive."default".a_307 values 
(
    17897, -- 2019-01-01
    1546300800000, -- 2019-01-01 00:00:00 +0000
    date('2019-01-01'), 
    cast(parse_datetime('2019-01-01 00:00:00 +0000', 'yyyy-MM-dd HH:mm:ss Z') as timestamp)
);

3. Create other table b_307 and change the types of column int_type and long_type

CREATE TABLE hive."default".b_307 (
   int_type date, -- type changed
   long_type timestamp, -- type changed
   date_type date,
   timestamp_type timestamp
)
WITH (
   -- the same location as table a_307
   external_location = 'hdfs://datacenter1:8020/tmp/307/',
   format = 'ORC'
);

4. Select table

Now, we can run select queries.

-- works on both 307- and 308+
select * from hive."default".a_307;

-- only works on 307-, raises error on 308+
select * from hive."default".b_307;

4.1 Prestosql-307

Both sqls work as expected.

(My timezone is +0800)

  1. The result of selecting table a_307 sql:
int_type long_type date_type timestamp_type
17897 1546300800000 2019-01-01 2019-01-01 08:00:00
  1. The result of selecting table b_307 sql:
int_type long_type date_type timestamp_type
2019-01-01 2019-01-01 08:00:00 2019-01-01 2019-01-01 08:00:00

4.2 Prestosql-308

Only the first sql works but the second sql raise error.

  1. The result of selecting table a_307 sql:

    (The same as above)

  2. The ERROR of selecting table b_307 sql:

io.prestosql.spi.PrestoException: Error opening Hive split hdfs://datacenter1:8020/tmp/307/20190711_080152_00034_dqexr_c744e714-1a9a-437f-bc9a-70721fffc937 (offset=0, length=493): Malformed ORC file. Can not read SQL type timestamp from ORC stream .long_type of type LONG [hdfs://datacenter1:8020/tmp/307/20190711_080152_00034_dqexr_c744e714-1a9a-437f-bc9a-70721fffc937]
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:234)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:120)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:162)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:96)
    at io.prestosql.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:53)
    at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
    at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:247)
    at io.prestosql.operator.Driver.processInternal(Driver.java:379)
    at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
    at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
    at io.prestosql.operator.Driver.processFor(Driver.java:276)
    at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)
    at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
    at io.prestosql.$gen.Presto_308____20190711_082302_1.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.prestosql.orc.OrcCorruptionException: Malformed ORC file. Can not read SQL type timestamp from ORC stream .long_type of type LONG [hdfs://datacenter1:8020/tmp/307/20190711_080152_00034_dqexr_c744e714-1a9a-437f-bc9a-70721fffc937]
    at io.prestosql.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:40)
    at io.prestosql.orc.reader.LongStreamReader.<init>(LongStreamReader.java:89)
    at io.prestosql.orc.reader.StreamReaders.createStreamReader(StreamReaders.java:39)
    at io.prestosql.orc.OrcRecordReader.createStreamReaders(OrcRecordReader.java:559)
    at io.prestosql.orc.OrcRecordReader.<init>(OrcRecordReader.java:240)
    at io.prestosql.orc.OrcReader.createRecordReader(OrcReader.java:249)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:204)
    ... 17 more
archongum commented 5 years ago

Update


Works between int and date types on prestosql-308+

select int_type, date_type from hive."default".b_307;

Result:

int_type date_type
2019-01-01 2019-01-01
archongum commented 5 years ago

Compare 307 and 308: https://github.com/prestosql/presto/compare/2ae6f76...25637d2

Are these commits relevant?

Snipaste_2019-07-12_10-32-02

dain commented 5 years ago

The new code is very strict about type the coercions that were implicitly happening in the old versions. There were many cases where the code perform a wrong/illegal conversion, and cases where expected coercions did not work. The new code restricts the type coercions to and explicit list that we believe is what ORC in Hive supports.

In this exact setup, the data in the file is a BIGINT (64bit signed integer) and the partition is saying that the table is a timestamp which is not a coercions we support anywhere in the Hive Presto code. Is this table readable in Hive?

archongum commented 5 years ago

@dain It's odd: This re-produce table b_307 can't be read by Hive. But the table from my dataware house can be read by Hive!

update: This re-produce table b_307 can be read by Hive too!
Error: java.io.IOException: java.lang.RuntimeException: ORC split generation failed with exception: Malformed ORC file. Invalid postscript length 17 (state=,code=0) this error is not relevant here and will occur if just only one file in that table directory, see: https://github.com/prestosql/presto/issues/456

Table fact_sales_order_item_cancel is the one which column sale_time can be read from long/bigint to timestamp in prestosql-307 and Hive but can not be read on prestosql-308+.

On Hive: Snipaste_2019-07-12_14-56-34

(this error is not relevant here) Snipaste_2019-07-12_15-05-45

On Prestosql-308:

io.prestosql.spi.PrestoException: Error opening Hive split hdfs://datacenter1:8020/dw_prod/dw/fact_sales_order_item_cancel/fact_sales_order_item_cancel.orc (offset=0, length=32796150): Malformed ORC file. Can not read SQL type timestamp from ORC stream .sale_time of type LONG [hdfs://datacenter1:8020/dw_prod/dw/fact_sales_order_item_cancel/fact_sales_order_item_cancel.orc]
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:235)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:121)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:164)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:98)
    at io.prestosql.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:53)
    at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
    at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:277)
    at io.prestosql.operator.Driver.processInternal(Driver.java:379)
    at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
    at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
    at io.prestosql.operator.Driver.processFor(Driver.java:276)
    at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)
    at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
    at io.prestosql.$gen.Presto_316____20190712_064008_1.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.prestosql.orc.OrcCorruptionException: Malformed ORC file. Can not read SQL type timestamp from ORC stream .sale_time of type LONG [hdfs://datacenter1:8020/dw_prod/dw/fact_sales_order_item_cancel/fact_sales_order_item_cancel.orc]
    at io.prestosql.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:40)
    at io.prestosql.orc.reader.LongStreamReader.<init>(LongStreamReader.java:89)
    at io.prestosql.orc.reader.StreamReaders.createStreamReader(StreamReaders.java:39)
    at io.prestosql.orc.OrcRecordReader.createStreamReaders(OrcRecordReader.java:559)
    at io.prestosql.orc.OrcRecordReader.<init>(OrcRecordReader.java:240)
    at io.prestosql.orc.OrcReader.createRecordReader(OrcReader.java:249)
    at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:205)
    ... 17 more

Table b_307' ORC meta:

Processing data file type_307.orc [length: 493]
Structure for type_307.orc
File Version: 0.12 with PRESTO_ORIGINAL
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<int_type:int,long_type:bigint,date_type:date,timestamp_type:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 17897 max: 17897 sum: 17897
    Column 2: count: 1 hasNull: true min: 1546300800000 max: 1546300800000 sum: 1546300800000
    Column 3: count: 1 hasNull: true min: 2019-01-01 max: 2019-01-01
    Column 4: count: 1 hasNull: true

File Statistics:
  Column 0: count: 1 hasNull: true
  Column 1: count: 1 hasNull: true min: 17897 max: 17897 sum: 17897
  Column 2: count: 1 hasNull: true min: 1546300800000 max: 1546300800000 sum: 1546300800000
  Column 3: count: 1 hasNull: true min: 2019-01-01 max: 2019-01-01
  Column 4: count: 1 hasNull: true

Stripes:
  Stripe: offset: 3 data: 40 rows: 1 tail: 76 index: 96
    Stream: column 1 section ROW_INDEX start: 3 length 27
    Stream: column 2 section ROW_INDEX start: 30 length 30
    Stream: column 3 section ROW_INDEX start: 60 length 24
    Stream: column 4 section ROW_INDEX start: 84 length 15
    Stream: column 4 section SECONDARY start: 99 length 6
    Stream: column 1 section DATA start: 105 length 7
    Stream: column 3 section DATA start: 112 length 7
    Stream: column 4 section DATA start: 119 length 9
    Stream: column 2 section DATA start: 128 length 11
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT_V2
    Encoding column 4: DIRECT_V2

File length: 493 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  presto.writer.version=308
  presto_query_id=20190711_080152_00034_dqexr
  presto_version=308

Table from my dataware house ORC meta (column: sale_time:bigint):

Processing data file fact_sales_order_item_cancel.orc [length: 32796150]
Structure for fact_sales_order_item_cancel.orc
File Version: 0.12 with HIVE_8732
Rows: 1577447
Compression: ZLIB
Compression size: 10240
Type: struct<sale_date_key:int,sale_time_key:string,store_key:int,cancel_date_key:int,cancel_time_key:string,goods_key:int,agent_key:int,employee_key:int,sales_order_type_key:int,sales_order_id:bigint,sales_order_item_id:bigint,sku_cnt:int,goods_original_amt:double,goods_amt:double,goods_actual_amt:double,goods_cost_amt:double,goods_activity_amt:double,goods_delivery_amt:double,sale_time:bigint,cancel_time:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1577447 hasNull: false
    Column 1: count: 1577447 hasNull: false min: 20160801 max: 20190711 sum: 31833094273046
    Column 2: count: 1577447 hasNull: false min: 0000 max: 2359 sum: 6309788
    Column 3: count: 1577447 hasNull: false min: 0 max: 256846 sum: 144526518912
    Column 4: count: 1577447 hasNull: false min: 20170409 max: 99991231 sum: 39481570890349
    Column 5: count: 1577447 hasNull: false min: 0000 max: 9999 sum: 6309788
    Column 6: count: 1577447 hasNull: false min: 1 max: 23663 sum: 16990010101
    Column 7: count: 1577447 hasNull: false min: 4 max: 30 sum: 27499502
    Column 8: count: 1577447 hasNull: false min: 0 max: 1123 sum: 786972351
    Column 9: count: 1577447 hasNull: false min: 1 max: 6 sum: 1622851
    Column 10: count: 1577447 hasNull: false min: 49 max: 9820524 sum: 12181981400360
    Column 11: count: 1577447 hasNull: false min: 21439619 max: 45661002 sum: 57335420317461
    Column 12: count: 1577447 hasNull: false min: 1 max: 2000 sum: 5050194
    Column 13: count: 1577447 hasNull: false min: 0.0 max: 84000.0 sum: 1.4682518291266719E8
    Column 14: count: 1577447 hasNull: false min: 0.0 max: 84000.0 sum: 1.431705717933843E8
    Column 15: count: 500224 hasNull: true min: 0.0 max: 76028.0 sum: 4.458023168999998E7
    Column 16: count: 1577447 hasNull: false min: 0.0 max: 0.0 sum: 0.0
    Column 17: count: 312633 hasNull: true min: 0.0 max: 76028.0 sum: 3.124265655991611E7
    Column 18: count: 312633 hasNull: true min: 0.0 max: 76028.0 sum: 3.0757970079629976E7
    Column 19: count: 1577447 hasNull: false min: 1469981244000 max: 1562860572000 sum: 2414000351563797000
    Column 20: count: 1481627 hasNull: true min: 1491722849000 max: 1562860629000 sum: 2270431488626707000

File Statistics:
  Column 0: count: 1577447 hasNull: false
  Column 1: count: 1577447 hasNull: false min: 20160801 max: 20190711 sum: 31833094273046
  Column 2: count: 1577447 hasNull: false min: 0000 max: 2359 sum: 6309788
  Column 3: count: 1577447 hasNull: false min: 0 max: 256846 sum: 144526518912
  Column 4: count: 1577447 hasNull: false min: 20170409 max: 99991231 sum: 39481570890349
  Column 5: count: 1577447 hasNull: false min: 0000 max: 9999 sum: 6309788
  Column 6: count: 1577447 hasNull: false min: 1 max: 23663 sum: 16990010101
  Column 7: count: 1577447 hasNull: false min: 4 max: 30 sum: 27499502
  Column 8: count: 1577447 hasNull: false min: 0 max: 1123 sum: 786972351
  Column 9: count: 1577447 hasNull: false min: 1 max: 6 sum: 1622851
  Column 10: count: 1577447 hasNull: false min: 49 max: 9820524 sum: 12181981400360
  Column 11: count: 1577447 hasNull: false min: 21439619 max: 45661002 sum: 57335420317461
  Column 12: count: 1577447 hasNull: false min: 1 max: 2000 sum: 5050194
  Column 13: count: 1577447 hasNull: false min: 0.0 max: 84000.0 sum: 1.4682518291266719E8
  Column 14: count: 1577447 hasNull: false min: 0.0 max: 84000.0 sum: 1.431705717933843E8
  Column 15: count: 500224 hasNull: true min: 0.0 max: 76028.0 sum: 4.458023168999998E7
  Column 16: count: 1577447 hasNull: false min: 0.0 max: 0.0 sum: 0.0
  Column 17: count: 312633 hasNull: true min: 0.0 max: 76028.0 sum: 3.124265655991611E7
  Column 18: count: 312633 hasNull: true min: 0.0 max: 76028.0 sum: 3.0757970079629976E7
  Column 19: count: 1577447 hasNull: false min: 1469981244000 max: 1562860572000 sum: 2414000351563797000
  Column 20: count: 1481627 hasNull: true min: 1491722849000 max: 1562860629000 sum: 2270431488626707000

Stripes:
  Stripe: offset: 3 data: 32739224 rows: 1577447 tail: 357 index: 55534
    Stream: column 0 section ROW_INDEX start: 3 length 34
    Stream: column 1 section ROW_INDEX start: 37 length 2475
    Stream: column 2 section ROW_INDEX start: 2512 length 1368
    Stream: column 3 section ROW_INDEX start: 3880 length 2601
    Stream: column 4 section ROW_INDEX start: 6481 length 2517
    Stream: column 5 section ROW_INDEX start: 8998 length 1487
    Stream: column 6 section ROW_INDEX start: 10485 length 2618
    Stream: column 7 section ROW_INDEX start: 13103 length 1559
    Stream: column 8 section ROW_INDEX start: 14662 length 2117
    Stream: column 9 section ROW_INDEX start: 16779 length 1431
    Stream: column 10 section ROW_INDEX start: 18210 length 3281
    Stream: column 11 section ROW_INDEX start: 21491 length 3603
    Stream: column 12 section ROW_INDEX start: 25094 length 2012
    Stream: column 13 section ROW_INDEX start: 27106 length 2787
    Stream: column 14 section ROW_INDEX start: 29893 length 2834
    Stream: column 15 section ROW_INDEX start: 32727 length 4259
    Stream: column 16 section ROW_INDEX start: 36986 length 842
    Stream: column 17 section ROW_INDEX start: 37828 length 4198
    Stream: column 18 section ROW_INDEX start: 42026 length 4514
    Stream: column 19 section ROW_INDEX start: 46540 length 4064
    Stream: column 20 section ROW_INDEX start: 50604 length 4933
    Stream: column 1 section DATA start: 55537 length 1148573
    Stream: column 2 section DATA start: 1204110 length 1099347
    Stream: column 2 section LENGTH start: 2303457 length 18
    Stream: column 2 section DICTIONARY_DATA start: 2303475 length 2441
    Stream: column 3 section DATA start: 2305916 length 1507374
    Stream: column 4 section DATA start: 3813290 length 1072379
    Stream: column 5 section DATA start: 4885669 length 1009603
    Stream: column 5 section LENGTH start: 5895272 length 18
    Stream: column 5 section DICTIONARY_DATA start: 5895290 length 2438
    Stream: column 6 section DATA start: 5897728 length 2628328
    Stream: column 7 section DATA start: 8526056 length 418152
    Stream: column 8 section DATA start: 8944208 length 935403
    Stream: column 9 section DATA start: 9879611 length 9230
    Stream: column 10 section DATA start: 9888841 length 1827951
    Stream: column 11 section DATA start: 11716792 length 3422466
    Stream: column 12 section DATA start: 15139258 length 816371
    Stream: column 13 section DATA start: 15955629 length 3741313
    Stream: column 14 section DATA start: 19696942 length 4031227
    Stream: column 15 section PRESENT start: 23728169 length 63913
    Stream: column 15 section DATA start: 23792082 length 1617878
    Stream: column 16 section DATA start: 25409960 length 80118
    Stream: column 17 section PRESENT start: 25490078 length 53640
    Stream: column 17 section DATA start: 25543718 length 850892
    Stream: column 18 section PRESENT start: 26394610 length 53640
    Stream: column 18 section DATA start: 26448250 length 1053144
    Stream: column 19 section DATA start: 27501394 length 2740567
    Stream: column 20 section PRESENT start: 30241961 length 9204
    Stream: column 20 section DATA start: 30251165 length 2543596
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DICTIONARY_V2[1439]
    Encoding column 3: DIRECT_V2
    Encoding column 4: DIRECT_V2
    Encoding column 5: DICTIONARY_V2[1440]
    Encoding column 6: DIRECT_V2
    Encoding column 7: DIRECT_V2
    Encoding column 8: DIRECT_V2
    Encoding column 9: DIRECT_V2
    Encoding column 10: DIRECT_V2
    Encoding column 11: DIRECT_V2
    Encoding column 12: DIRECT_V2
    Encoding column 13: DIRECT
    Encoding column 14: DIRECT
    Encoding column 15: DIRECT
    Encoding column 16: DIRECT
    Encoding column 17: DIRECT
    Encoding column 18: DIRECT
    Encoding column 19: DIRECT_V2
    Encoding column 20: DIRECT_V2

File length: 32796150 bytes
Padding length: 0 bytes
Padding ratio: 0%

I'm going to look for more details!

archongum commented 5 years ago

@dain From ORC type to Table type

  1. prestosql-307 or earlier
operation doable
from int to date
from long/bigint to timestamp
  1. prestosql-308 or later
operation doable
from int to date
from long/bigint to timestamp
  1. hive
operation doable
from int to date
from long/bigint to timestamp
archongum commented 5 years ago

The new code is very strict about type the coercions that were implicitly happening in the old versions. There were many cases where the code perform a wrong/illegal conversion, and cases where expected coercions did not work. The new code restricts the type coercions to and explicit list that we believe is what ORC in Hive supports.