trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.38k stars 2.98k forks source link

Trino fails to read hive tables which have undergone valid schema evolution #9162

Open vinay-kl opened 3 years ago

vinay-kl commented 3 years ago

Summary

Schema evolution by changing/widening column datatype is supported on HIVE, Trino is unable to read data from these tables. This is the case for both Transactional and External tables.

Note: ** Create Table syntax are in Hive SQL

The following example is for Hive Transactional table


jdbc:hive2://localhost:10001/> create table schema_evol_test_latest.a_string_b_int_non_par_non_buck (a string, b int);

jdbc:hive2://localhost:10001/> insert into schema_evol_test_latest.a_string_b_int_non_par_non_buck values ("a", 1), ("b", 2);

jdbc:hive2://localhost:10001/> show create table schema_evol_test_latest.a_string_b_int_non_par_non_buck; +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck( | | a string, | | b int) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | | LOCATION | | 'abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck' | | TBLPROPERTIES ( | | 'bucketing_version'='2', | | 'transactional'='true', | | 'transactional_properties'='default', | | 'transient_lastDdlTime'='1631131803') | +----------------------------------------------------+

trino> show create table schema_evol_test_latest.a_string_b_int_non_par_non_buck; Create Table

CREATE TABLE hive.schema_evol_test_latest.a_string_b_int_non_par_non_buck ( a varchar,
b integer
)
WITH (
format = 'ORC',
transactional = true
)
(1 row)

jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck; +------------------------------------+------------------------------------+ | a_string_b_int_non_par_non_buck.a | a_string_b_int_non_par_non_buck.b | +------------------------------------+------------------------------------+ | a | 1 | | b | 2 | +------------------------------------+------------------------------------+

trino> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck; a | b ---+--- a | 1 b | 2 (2 rows)

jdbc:hive2://localhost:10001/> ALTER TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck CHANGE COLUMN b b string; --> valid schema change

jdbc:hive2://localhost:10001/> show create table schema_evol_test_latest.a_string_b_int_non_par_non_buck; +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck( | | a string, | | b string) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | | LOCATION | | 'abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck' | | TBLPROPERTIES ( | | 'bucketing_version'='2', | | 'last_modified_by'='anonymous', | | 'last_modified_time'='1631131955', | | 'transactional'='true', | | 'transactional_properties'='default', | | 'transient_lastDdlTime'='1631131955') | +----------------------------------------------------+

jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck; --> successful +------------------------------------+------------------------------------+ | a_string_b_int_non_par_non_buck.a | a_string_b_int_non_par_non_buck.b | +------------------------------------+------------------------------------+ | a | 1 | | b | 2 | +------------------------------------+------------------------------------+

trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck;

Query 20210908_201307_01042_xfgng, FAILED, 1 node https://bifrostx.myntra.com/ui/query.html?20210908_201307_01042_xfgng Splits: 1 total, 0 done (0.00%) CPU Time: 0.3s total, 0 rows/s, 0B/s, 88% active Per Node: 0.4 parallelism, 0 rows/s, 0B/s Parallelism: 0.4 Peak Memory: 0B 0.73 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20210908_201307_01042_xfgng failed: Error opening Hive split abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000 (offset=0, length=712): Malformed ORC file. Cannot read SQL type 'varchar' from ORC stream '.row.b' of type INT with attributes {} [abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000] io.trino.spi.TrinoException: Error opening Hive split abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000 (offset=0, length=712): Malformed ORC file. Cannot read SQL type 'varchar' from ORC stream '.row.b' of type INT with attributes {} [abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000] at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:454) at io.trino.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:182) at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:286) at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:175) at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49) at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:64) at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:308) at io.trino.operator.Driver.processInternal(Driver.java:387) at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291) at io.trino.operator.Driver.tryWithLock(Driver.java:683) at io.trino.operator.Driver.processFor(Driver.java:284) at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076) at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163) at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484) at io.trino.$gen.Trino_361____20210908_171302_2.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: io.trino.orc.OrcCorruptionException: Malformed ORC file. Cannot read SQL type 'varchar' from ORC stream '.row.b' of type INT with attributes {} [abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000] at io.trino.orc.reader.ReaderUtils.invalidStreamType(ReaderUtils.java:45) at io.trino.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:32) at io.trino.orc.reader.LongColumnReader.(LongColumnReader.java:89) at io.trino.orc.reader.ColumnReaders.createColumnReader(ColumnReaders.java:59) at io.trino.orc.OrcRecordReader.createColumnReaders(OrcRecordReader.java:579) at io.trino.orc.OrcRecordReader.(OrcRecordReader.java:244) at io.trino.orc.OrcReader.createRecordReader(OrcReader.java:308) at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:372) ... 17 more

jdbc:hive2://localhost:10001/> ALTER TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck COMPACT 'major';

trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck; a | b ---+--- a | 1 b | 2 (2 rows)


> * same issue is encountered even with partitioned tables.
> * post successful `major` compaction, the reading of data is as expected.
> * if the column data type is changed from `int to bigint`, it's working fine on Trino without compaction.

> The following example is for Hive External table

jdbc:hive2://localhost:10001/> create external table a_string_b_int_non_par_non_buck_ext (a string, b float) stored as parquet;

jdbc:hive2://localhost:10001/> insert into a_string_b_int_non_par_non_buck_ext values ("a", 1.1), ("b", 2.2);

jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext; +----------------------------------------+----------------------------------------+ | a_string_b_int_non_par_non_buck_ext.a | a_string_b_int_non_par_non_buck_ext.b | +----------------------------------------+----------------------------------------+ | a | 1.1 | | b | 2.2 | +----------------------------------------+----------------------------------------+

trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext; a | b
---+----- a | 1.1 b | 2.2 (2 rows)

jdbc:hive2://localhost:10001/> ALTER TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext CHANGE COLUMN b b double;

jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext; ---> some issue on hive as well, implicitly extra values getting added post decimal 8th decimal point +----------------------------------------+----------------------------------------+ | a_string_b_int_non_par_non_buck_ext.a | a_string_b_int_non_par_non_buck_ext.b | +----------------------------------------+----------------------------------------+ | a | 1.100000023841858 | | b | 2.200000047683716 | +----------------------------------------+----------------------------------------+

trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext;

Query 20210908_202515_01110_xfgng, FAILED, 1 node https://bifrostx.myntra.com/ui/query.html?20210908_202515_01110_xfgng Splits: 1 total, 0 done (0.00%) CPU Time: 0.0s total, 0 rows/s, 0B/s, 0% active Per Node: 0.0 parallelism, 0 rows/s, 0B/s Parallelism: 0.0 Peak Memory: 0B 0.81 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20210908_202515_01110_xfgng failed: io.trino.spi.type.DoubleType java.lang.UnsupportedOperationException: io.trino.spi.type.DoubleType at io.trino.spi.type.AbstractType.writeLong(AbstractType.java:91) at io.trino.parquet.reader.FloatColumnReader.readValue(FloatColumnReader.java:34) at io.trino.parquet.reader.PrimitiveColumnReader.lambda$readValues$2(PrimitiveColumnReader.java:195) at io.trino.parquet.reader.PrimitiveColumnReader.processValues(PrimitiveColumnReader.java:251) at io.trino.parquet.reader.PrimitiveColumnReader.readValues(PrimitiveColumnReader.java:194) at io.trino.parquet.reader.PrimitiveColumnReader.readPrimitive(PrimitiveColumnReader.java:183) at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:370) at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:444) at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:427) at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:224) at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:396) at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:375) at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:282) at io.trino.spi.Page.getLoadedPage(Page.java:279) at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:314) at io.trino.operator.Driver.processInternal(Driver.java:387) at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291) at io.trino.operator.Driver.tryWithLock(Driver.java:683) at io.trino.operator.Driver.processFor(Driver.java:284) at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076) at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163) at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484) at io.trino.$gen.Trino_361____20210908_171253_2.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)

electrum commented 3 years ago

There is an existing test for this which should be modified to pass when this is fixed.