Closed archongum closed 5 years ago
Update
Works between int
and date
types on prestosql-308+
select int_type, date_type from hive."default".b_307;
Result:
int_type | date_type |
---|---|
2019-01-01 | 2019-01-01 |
Compare 307 and 308: https://github.com/prestosql/presto/compare/2ae6f76...25637d2
Are these commits relevant?
The new code is very strict about type the coercions that were implicitly happening in the old versions. There were many cases where the code perform a wrong/illegal conversion, and cases where expected coercions did not work. The new code restricts the type coercions to and explicit list that we believe is what ORC in Hive supports.
In this exact setup, the data in the file is a BIGINT (64bit signed integer) and the partition is saying that the table is a timestamp which is not a coercions we support anywhere in the Hive Presto code. Is this table readable in Hive?
@dain It's odd: This re-produce table b_307
can't be read by Hive. But the table from my dataware house can be read by Hive!
update: This re-produce table
b_307
can be read by Hive too!
Error: java.io.IOException: java.lang.RuntimeException: ORC split generation failed with exception: Malformed ORC file. Invalid postscript length 17 (state=,code=0)
this error is not relevant here and will occur if just only one file in that table directory, see: https://github.com/prestosql/presto/issues/456
Table fact_sales_order_item_cancel
is the one which column sale_time
can be read from long/bigint
to timestamp
in prestosql-307 and Hive but can not be read on prestosql-308+.
On Hive:
(this error is not relevant here)
On Prestosql-308:
io.prestosql.spi.PrestoException: Error opening Hive split hdfs://datacenter1:8020/dw_prod/dw/fact_sales_order_item_cancel/fact_sales_order_item_cancel.orc (offset=0, length=32796150): Malformed ORC file. Can not read SQL type timestamp from ORC stream .sale_time of type LONG [hdfs://datacenter1:8020/dw_prod/dw/fact_sales_order_item_cancel/fact_sales_order_item_cancel.orc]
at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:235)
at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:121)
at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:164)
at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:98)
at io.prestosql.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:53)
at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:277)
at io.prestosql.operator.Driver.processInternal(Driver.java:379)
at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
at io.prestosql.operator.Driver.processFor(Driver.java:276)
at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)
at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
at io.prestosql.$gen.Presto_316____20190712_064008_1.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.prestosql.orc.OrcCorruptionException: Malformed ORC file. Can not read SQL type timestamp from ORC stream .sale_time of type LONG [hdfs://datacenter1:8020/dw_prod/dw/fact_sales_order_item_cancel/fact_sales_order_item_cancel.orc]
at io.prestosql.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:40)
at io.prestosql.orc.reader.LongStreamReader.<init>(LongStreamReader.java:89)
at io.prestosql.orc.reader.StreamReaders.createStreamReader(StreamReaders.java:39)
at io.prestosql.orc.OrcRecordReader.createStreamReaders(OrcRecordReader.java:559)
at io.prestosql.orc.OrcRecordReader.<init>(OrcRecordReader.java:240)
at io.prestosql.orc.OrcReader.createRecordReader(OrcReader.java:249)
at io.prestosql.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:205)
... 17 more
Table b_307
' ORC meta:
Processing data file type_307.orc [length: 493]
Structure for type_307.orc
File Version: 0.12 with PRESTO_ORIGINAL
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<int_type:int,long_type:bigint,date_type:date,timestamp_type:timestamp>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 17897 max: 17897 sum: 17897
Column 2: count: 1 hasNull: true min: 1546300800000 max: 1546300800000 sum: 1546300800000
Column 3: count: 1 hasNull: true min: 2019-01-01 max: 2019-01-01
Column 4: count: 1 hasNull: true
File Statistics:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 17897 max: 17897 sum: 17897
Column 2: count: 1 hasNull: true min: 1546300800000 max: 1546300800000 sum: 1546300800000
Column 3: count: 1 hasNull: true min: 2019-01-01 max: 2019-01-01
Column 4: count: 1 hasNull: true
Stripes:
Stripe: offset: 3 data: 40 rows: 1 tail: 76 index: 96
Stream: column 1 section ROW_INDEX start: 3 length 27
Stream: column 2 section ROW_INDEX start: 30 length 30
Stream: column 3 section ROW_INDEX start: 60 length 24
Stream: column 4 section ROW_INDEX start: 84 length 15
Stream: column 4 section SECONDARY start: 99 length 6
Stream: column 1 section DATA start: 105 length 7
Stream: column 3 section DATA start: 112 length 7
Stream: column 4 section DATA start: 119 length 9
Stream: column 2 section DATA start: 128 length 11
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT_V2
Encoding column 4: DIRECT_V2
File length: 493 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
presto.writer.version=308
presto_query_id=20190711_080152_00034_dqexr
presto_version=308
Table from my dataware house ORC meta (column: sale_time:bigint
):
Processing data file fact_sales_order_item_cancel.orc [length: 32796150]
Structure for fact_sales_order_item_cancel.orc
File Version: 0.12 with HIVE_8732
Rows: 1577447
Compression: ZLIB
Compression size: 10240
Type: struct<sale_date_key:int,sale_time_key:string,store_key:int,cancel_date_key:int,cancel_time_key:string,goods_key:int,agent_key:int,employee_key:int,sales_order_type_key:int,sales_order_id:bigint,sales_order_item_id:bigint,sku_cnt:int,goods_original_amt:double,goods_amt:double,goods_actual_amt:double,goods_cost_amt:double,goods_activity_amt:double,goods_delivery_amt:double,sale_time:bigint,cancel_time:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1577447 hasNull: false
Column 1: count: 1577447 hasNull: false min: 20160801 max: 20190711 sum: 31833094273046
Column 2: count: 1577447 hasNull: false min: 0000 max: 2359 sum: 6309788
Column 3: count: 1577447 hasNull: false min: 0 max: 256846 sum: 144526518912
Column 4: count: 1577447 hasNull: false min: 20170409 max: 99991231 sum: 39481570890349
Column 5: count: 1577447 hasNull: false min: 0000 max: 9999 sum: 6309788
Column 6: count: 1577447 hasNull: false min: 1 max: 23663 sum: 16990010101
Column 7: count: 1577447 hasNull: false min: 4 max: 30 sum: 27499502
Column 8: count: 1577447 hasNull: false min: 0 max: 1123 sum: 786972351
Column 9: count: 1577447 hasNull: false min: 1 max: 6 sum: 1622851
Column 10: count: 1577447 hasNull: false min: 49 max: 9820524 sum: 12181981400360
Column 11: count: 1577447 hasNull: false min: 21439619 max: 45661002 sum: 57335420317461
Column 12: count: 1577447 hasNull: false min: 1 max: 2000 sum: 5050194
Column 13: count: 1577447 hasNull: false min: 0.0 max: 84000.0 sum: 1.4682518291266719E8
Column 14: count: 1577447 hasNull: false min: 0.0 max: 84000.0 sum: 1.431705717933843E8
Column 15: count: 500224 hasNull: true min: 0.0 max: 76028.0 sum: 4.458023168999998E7
Column 16: count: 1577447 hasNull: false min: 0.0 max: 0.0 sum: 0.0
Column 17: count: 312633 hasNull: true min: 0.0 max: 76028.0 sum: 3.124265655991611E7
Column 18: count: 312633 hasNull: true min: 0.0 max: 76028.0 sum: 3.0757970079629976E7
Column 19: count: 1577447 hasNull: false min: 1469981244000 max: 1562860572000 sum: 2414000351563797000
Column 20: count: 1481627 hasNull: true min: 1491722849000 max: 1562860629000 sum: 2270431488626707000
File Statistics:
Column 0: count: 1577447 hasNull: false
Column 1: count: 1577447 hasNull: false min: 20160801 max: 20190711 sum: 31833094273046
Column 2: count: 1577447 hasNull: false min: 0000 max: 2359 sum: 6309788
Column 3: count: 1577447 hasNull: false min: 0 max: 256846 sum: 144526518912
Column 4: count: 1577447 hasNull: false min: 20170409 max: 99991231 sum: 39481570890349
Column 5: count: 1577447 hasNull: false min: 0000 max: 9999 sum: 6309788
Column 6: count: 1577447 hasNull: false min: 1 max: 23663 sum: 16990010101
Column 7: count: 1577447 hasNull: false min: 4 max: 30 sum: 27499502
Column 8: count: 1577447 hasNull: false min: 0 max: 1123 sum: 786972351
Column 9: count: 1577447 hasNull: false min: 1 max: 6 sum: 1622851
Column 10: count: 1577447 hasNull: false min: 49 max: 9820524 sum: 12181981400360
Column 11: count: 1577447 hasNull: false min: 21439619 max: 45661002 sum: 57335420317461
Column 12: count: 1577447 hasNull: false min: 1 max: 2000 sum: 5050194
Column 13: count: 1577447 hasNull: false min: 0.0 max: 84000.0 sum: 1.4682518291266719E8
Column 14: count: 1577447 hasNull: false min: 0.0 max: 84000.0 sum: 1.431705717933843E8
Column 15: count: 500224 hasNull: true min: 0.0 max: 76028.0 sum: 4.458023168999998E7
Column 16: count: 1577447 hasNull: false min: 0.0 max: 0.0 sum: 0.0
Column 17: count: 312633 hasNull: true min: 0.0 max: 76028.0 sum: 3.124265655991611E7
Column 18: count: 312633 hasNull: true min: 0.0 max: 76028.0 sum: 3.0757970079629976E7
Column 19: count: 1577447 hasNull: false min: 1469981244000 max: 1562860572000 sum: 2414000351563797000
Column 20: count: 1481627 hasNull: true min: 1491722849000 max: 1562860629000 sum: 2270431488626707000
Stripes:
Stripe: offset: 3 data: 32739224 rows: 1577447 tail: 357 index: 55534
Stream: column 0 section ROW_INDEX start: 3 length 34
Stream: column 1 section ROW_INDEX start: 37 length 2475
Stream: column 2 section ROW_INDEX start: 2512 length 1368
Stream: column 3 section ROW_INDEX start: 3880 length 2601
Stream: column 4 section ROW_INDEX start: 6481 length 2517
Stream: column 5 section ROW_INDEX start: 8998 length 1487
Stream: column 6 section ROW_INDEX start: 10485 length 2618
Stream: column 7 section ROW_INDEX start: 13103 length 1559
Stream: column 8 section ROW_INDEX start: 14662 length 2117
Stream: column 9 section ROW_INDEX start: 16779 length 1431
Stream: column 10 section ROW_INDEX start: 18210 length 3281
Stream: column 11 section ROW_INDEX start: 21491 length 3603
Stream: column 12 section ROW_INDEX start: 25094 length 2012
Stream: column 13 section ROW_INDEX start: 27106 length 2787
Stream: column 14 section ROW_INDEX start: 29893 length 2834
Stream: column 15 section ROW_INDEX start: 32727 length 4259
Stream: column 16 section ROW_INDEX start: 36986 length 842
Stream: column 17 section ROW_INDEX start: 37828 length 4198
Stream: column 18 section ROW_INDEX start: 42026 length 4514
Stream: column 19 section ROW_INDEX start: 46540 length 4064
Stream: column 20 section ROW_INDEX start: 50604 length 4933
Stream: column 1 section DATA start: 55537 length 1148573
Stream: column 2 section DATA start: 1204110 length 1099347
Stream: column 2 section LENGTH start: 2303457 length 18
Stream: column 2 section DICTIONARY_DATA start: 2303475 length 2441
Stream: column 3 section DATA start: 2305916 length 1507374
Stream: column 4 section DATA start: 3813290 length 1072379
Stream: column 5 section DATA start: 4885669 length 1009603
Stream: column 5 section LENGTH start: 5895272 length 18
Stream: column 5 section DICTIONARY_DATA start: 5895290 length 2438
Stream: column 6 section DATA start: 5897728 length 2628328
Stream: column 7 section DATA start: 8526056 length 418152
Stream: column 8 section DATA start: 8944208 length 935403
Stream: column 9 section DATA start: 9879611 length 9230
Stream: column 10 section DATA start: 9888841 length 1827951
Stream: column 11 section DATA start: 11716792 length 3422466
Stream: column 12 section DATA start: 15139258 length 816371
Stream: column 13 section DATA start: 15955629 length 3741313
Stream: column 14 section DATA start: 19696942 length 4031227
Stream: column 15 section PRESENT start: 23728169 length 63913
Stream: column 15 section DATA start: 23792082 length 1617878
Stream: column 16 section DATA start: 25409960 length 80118
Stream: column 17 section PRESENT start: 25490078 length 53640
Stream: column 17 section DATA start: 25543718 length 850892
Stream: column 18 section PRESENT start: 26394610 length 53640
Stream: column 18 section DATA start: 26448250 length 1053144
Stream: column 19 section DATA start: 27501394 length 2740567
Stream: column 20 section PRESENT start: 30241961 length 9204
Stream: column 20 section DATA start: 30251165 length 2543596
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DICTIONARY_V2[1439]
Encoding column 3: DIRECT_V2
Encoding column 4: DIRECT_V2
Encoding column 5: DICTIONARY_V2[1440]
Encoding column 6: DIRECT_V2
Encoding column 7: DIRECT_V2
Encoding column 8: DIRECT_V2
Encoding column 9: DIRECT_V2
Encoding column 10: DIRECT_V2
Encoding column 11: DIRECT_V2
Encoding column 12: DIRECT_V2
Encoding column 13: DIRECT
Encoding column 14: DIRECT
Encoding column 15: DIRECT
Encoding column 16: DIRECT
Encoding column 17: DIRECT
Encoding column 18: DIRECT
Encoding column 19: DIRECT_V2
Encoding column 20: DIRECT_V2
File length: 32796150 bytes
Padding length: 0 bytes
Padding ratio: 0%
I'm going to look for more details!
@dain From ORC type to Table type
operation | doable |
---|---|
from int to date |
✅ |
from long/bigint to timestamp |
✅ |
operation | doable |
---|---|
from int to date |
✅ |
from long/bigint to timestamp |
❌ |
operation | doable |
---|---|
from int to date |
❌ |
from long/bigint to timestamp |
✅ |
The new code is very strict about type the coercions that were implicitly happening in the old versions. There were many cases where the code perform a wrong/illegal conversion, and cases where expected coercions did not work. The new code restricts the type coercions to and explicit list that we believe is what ORC in Hive supports.
(Before all)
This error occurs since prestosql-308. According to my tests:
Re-produce steps
0. Presto version
I use 307 and 308.
Do the following steps on 307 or 308. (I did these on 307 and I think it doesn't matter on which version)
1. Create table
a_307
The columns explain themself.
2. Insert a record
3. Create other table
b_307
and change the types of columnint_type
andlong_type
4. Select table
Now, we can run select queries.
4.1 Prestosql-307
Both sqls work as expected.
(My timezone is +0800)
a_307
sql:b_307
sql:4.2 Prestosql-308
Only the first sql works but the second sql raise error.
The result of selecting table
a_307
sql:(The same as above)
The ERROR of selecting table
b_307
sql: