Schema evolution by changing/widening column datatype is supported on HIVE, Trino is unable to read data from these tables. This is the case for both Transactional and External tables.
Note: ** Create Table syntax are in Hive SQL
The following example is for Hive Transactional table
jdbc:hive2://localhost:10001/> create table schema_evol_test_latest.a_string_b_int_non_par_non_buck (a string, b int);
jdbc:hive2://localhost:10001/> insert into schema_evol_test_latest.a_string_b_int_non_par_non_buck values ("a", 1), ("b", 2);
trino> show create table schema_evol_test_latest.a_string_b_int_non_par_non_buck;
Create Table
CREATE TABLE hive.schema_evol_test_latest.a_string_b_int_non_par_non_buck (
a varchar,
b integer
)
WITH (
format = 'ORC',
transactional = true
)
(1 row)
jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck;
+------------------------------------+------------------------------------+
| a_string_b_int_non_par_non_buck.a | a_string_b_int_non_par_non_buck.b |
+------------------------------------+------------------------------------+
| a | 1 |
| b | 2 |
+------------------------------------+------------------------------------+
trino> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck;
a | b
---+---
a | 1
b | 2
(2 rows)
Query 20210908_201307_01042_xfgng failed: Error opening Hive split abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000 (offset=0, length=712): Malformed ORC file. Cannot read SQL type 'varchar' from ORC stream '.row.b' of type INT with attributes {} [abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000]
io.trino.spi.TrinoException: Error opening Hive split abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000 (offset=0, length=712): Malformed ORC file. Cannot read SQL type 'varchar' from ORC stream '.row.b' of type INT with attributes {} [abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000]
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:454)
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:182)
at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:286)
at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:175)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:64)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:308)
at io.trino.operator.Driver.processInternal(Driver.java:387)
at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291)
at io.trino.operator.Driver.tryWithLock(Driver.java:683)
at io.trino.operator.Driver.processFor(Driver.java:284)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
at io.trino.$gen.Trino_361____20210908_171302_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.trino.orc.OrcCorruptionException: Malformed ORC file. Cannot read SQL type 'varchar' from ORC stream '.row.b' of type INT with attributes {} [abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000]
at io.trino.orc.reader.ReaderUtils.invalidStreamType(ReaderUtils.java:45)
at io.trino.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:32)
at io.trino.orc.reader.LongColumnReader.(LongColumnReader.java:89)
at io.trino.orc.reader.ColumnReaders.createColumnReader(ColumnReaders.java:59)
at io.trino.orc.OrcRecordReader.createColumnReaders(OrcRecordReader.java:579)
at io.trino.orc.OrcRecordReader.(OrcRecordReader.java:244)
at io.trino.orc.OrcReader.createRecordReader(OrcReader.java:308)
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:372)
... 17 more
jdbc:hive2://localhost:10001/> ALTER TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck COMPACT 'major';
trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck;
a | b
---+---
a | 1
b | 2
(2 rows)
> * same issue is encountered even with partitioned tables.
> * post successful `major` compaction, the reading of data is as expected.
> * if the column data type is changed from `int to bigint`, it's working fine on Trino without compaction.
> The following example is for Hive External table
jdbc:hive2://localhost:10001/> create external table a_string_b_int_non_par_non_buck_ext (a string, b float) stored as parquet;
jdbc:hive2://localhost:10001/> insert into a_string_b_int_non_par_non_buck_ext values ("a", 1.1), ("b", 2.2);
jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext;
+----------------------------------------+----------------------------------------+
| a_string_b_int_non_par_non_buck_ext.a | a_string_b_int_non_par_non_buck_ext.b |
+----------------------------------------+----------------------------------------+
| a | 1.1 |
| b | 2.2 |
+----------------------------------------+----------------------------------------+
trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext;
a | b
---+-----
a | 1.1
b | 2.2
(2 rows)
jdbc:hive2://localhost:10001/> ALTER TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext CHANGE COLUMN bb double;
jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext; ---> some issue on hive as well, implicitly extra values getting added post decimal 8th decimal point
+----------------------------------------+----------------------------------------+
| a_string_b_int_non_par_non_buck_ext.a | a_string_b_int_non_par_non_buck_ext.b |
+----------------------------------------+----------------------------------------+
| a | 1.100000023841858 |
| b | 2.200000047683716 |
+----------------------------------------+----------------------------------------+
trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext;
Query 20210908_202515_01110_xfgng failed: io.trino.spi.type.DoubleType
java.lang.UnsupportedOperationException: io.trino.spi.type.DoubleType
at io.trino.spi.type.AbstractType.writeLong(AbstractType.java:91)
at io.trino.parquet.reader.FloatColumnReader.readValue(FloatColumnReader.java:34)
at io.trino.parquet.reader.PrimitiveColumnReader.lambda$readValues$2(PrimitiveColumnReader.java:195)
at io.trino.parquet.reader.PrimitiveColumnReader.processValues(PrimitiveColumnReader.java:251)
at io.trino.parquet.reader.PrimitiveColumnReader.readValues(PrimitiveColumnReader.java:194)
at io.trino.parquet.reader.PrimitiveColumnReader.readPrimitive(PrimitiveColumnReader.java:183)
at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:370)
at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:444)
at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:427)
at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:224)
at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:396)
at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:375)
at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:282)
at io.trino.spi.Page.getLoadedPage(Page.java:279)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:314)
at io.trino.operator.Driver.processInternal(Driver.java:387)
at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291)
at io.trino.operator.Driver.tryWithLock(Driver.java:683)
at io.trino.operator.Driver.processFor(Driver.java:284)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
at io.trino.$gen.Trino_361____20210908_171253_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Summary
Schema evolution by changing/widening column datatype is supported on HIVE, Trino is unable to read data from these tables. This is the case for both Transactional and External tables.
jdbc:hive2://localhost:10001/> insert into schema_evol_test_latest.a_string_b_int_non_par_non_buck values ("a", 1), ("b", 2);
jdbc:hive2://localhost:10001/> show create table schema_evol_test_latest.a_string_b_int_non_par_non_buck; +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE
schema_evol_test_latest.a_string_b_int_non_par_non_buck
( | |a
string, | |b
int) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | | LOCATION | | 'abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck' | | TBLPROPERTIES ( | | 'bucketing_version'='2', | | 'transactional'='true', | | 'transactional_properties'='default', | | 'transient_lastDdlTime'='1631131803') | +----------------------------------------------------+trino> show create table schema_evol_test_latest.a_string_b_int_non_par_non_buck; Create Table
CREATE TABLE hive.schema_evol_test_latest.a_string_b_int_non_par_non_buck ( a varchar,
b integer
)
WITH (
format = 'ORC',
transactional = true
)
(1 row)
jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck; +------------------------------------+------------------------------------+ | a_string_b_int_non_par_non_buck.a | a_string_b_int_non_par_non_buck.b | +------------------------------------+------------------------------------+ | a | 1 | | b | 2 | +------------------------------------+------------------------------------+
trino> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck; a | b ---+--- a | 1 b | 2 (2 rows)
jdbc:hive2://localhost:10001/> ALTER TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck CHANGE COLUMN
b
b
string; --> valid schema changejdbc:hive2://localhost:10001/> show create table schema_evol_test_latest.a_string_b_int_non_par_non_buck; +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE
schema_evol_test_latest.a_string_b_int_non_par_non_buck
( | |a
string, | |b
string) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | | LOCATION | | 'abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck' | | TBLPROPERTIES ( | | 'bucketing_version'='2', | | 'last_modified_by'='anonymous', | | 'last_modified_time'='1631131955', | | 'transactional'='true', | | 'transactional_properties'='default', | | 'transient_lastDdlTime'='1631131955') | +----------------------------------------------------+jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck; --> successful +------------------------------------+------------------------------------+ | a_string_b_int_non_par_non_buck.a | a_string_b_int_non_par_non_buck.b | +------------------------------------+------------------------------------+ | a | 1 | | b | 2 | +------------------------------------+------------------------------------+
trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck;
Query 20210908_201307_01042_xfgng, FAILED, 1 node https://bifrostx.myntra.com/ui/query.html?20210908_201307_01042_xfgng Splits: 1 total, 0 done (0.00%) CPU Time: 0.3s total, 0 rows/s, 0B/s, 88% active Per Node: 0.4 parallelism, 0 rows/s, 0B/s Parallelism: 0.4 Peak Memory: 0B 0.73 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20210908_201307_01042_xfgng failed: Error opening Hive split abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000 (offset=0, length=712): Malformed ORC file. Cannot read SQL type 'varchar' from ORC stream '.row.b' of type INT with attributes {} [abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000] io.trino.spi.TrinoException: Error opening Hive split abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000 (offset=0, length=712): Malformed ORC file. Cannot read SQL type 'varchar' from ORC stream '.row.b' of type INT with attributes {} [abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000] at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:454) at io.trino.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:182) at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:286) at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:175) at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49) at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:64) at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:308) at io.trino.operator.Driver.processInternal(Driver.java:387) at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291) at io.trino.operator.Driver.tryWithLock(Driver.java:683) at io.trino.operator.Driver.processFor(Driver.java:284) at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076) at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163) at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484) at io.trino.$gen.Trino_361____20210908_171302_2.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: io.trino.orc.OrcCorruptionException: Malformed ORC file. Cannot read SQL type 'varchar' from ORC stream '.row.b' of type INT with attributes {} [abfs://bifrostx-hive-data@gen2hivebifros.dfs.core.windows.net/qa-data/schema_evol_test_latest.db/a_string_b_int_non_par_non_buck/delta_0000001_0000001_0000/bucket_00000] at io.trino.orc.reader.ReaderUtils.invalidStreamType(ReaderUtils.java:45) at io.trino.orc.reader.ReaderUtils.verifyStreamType(ReaderUtils.java:32) at io.trino.orc.reader.LongColumnReader.(LongColumnReader.java:89)
at io.trino.orc.reader.ColumnReaders.createColumnReader(ColumnReaders.java:59)
at io.trino.orc.OrcRecordReader.createColumnReaders(OrcRecordReader.java:579)
at io.trino.orc.OrcRecordReader.(OrcRecordReader.java:244)
at io.trino.orc.OrcReader.createRecordReader(OrcReader.java:308)
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:372)
... 17 more
jdbc:hive2://localhost:10001/> ALTER TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck COMPACT 'major';
trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck; a | b ---+--- a | 1 b | 2 (2 rows)
jdbc:hive2://localhost:10001/> create external table a_string_b_int_non_par_non_buck_ext (a string, b float) stored as parquet;
jdbc:hive2://localhost:10001/> insert into a_string_b_int_non_par_non_buck_ext values ("a", 1.1), ("b", 2.2);
jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext; +----------------------------------------+----------------------------------------+ | a_string_b_int_non_par_non_buck_ext.a | a_string_b_int_non_par_non_buck_ext.b | +----------------------------------------+----------------------------------------+ | a | 1.1 | | b | 2.2 | +----------------------------------------+----------------------------------------+
trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext; a | b
---+----- a | 1.1 b | 2.2 (2 rows)
jdbc:hive2://localhost:10001/> ALTER TABLE schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext CHANGE COLUMN
b
b
double;jdbc:hive2://localhost:10001/> select * from schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext; ---> some issue on hive as well, implicitly extra values getting added post decimal 8th decimal point +----------------------------------------+----------------------------------------+ | a_string_b_int_non_par_non_buck_ext.a | a_string_b_int_non_par_non_buck_ext.b | +----------------------------------------+----------------------------------------+ | a | 1.100000023841858 | | b | 2.200000047683716 | +----------------------------------------+----------------------------------------+
trino> select * from hiveqa.schema_evol_test_latest.a_string_b_int_non_par_non_buck_ext;
Query 20210908_202515_01110_xfgng, FAILED, 1 node https://bifrostx.myntra.com/ui/query.html?20210908_202515_01110_xfgng Splits: 1 total, 0 done (0.00%) CPU Time: 0.0s total, 0 rows/s, 0B/s, 0% active Per Node: 0.0 parallelism, 0 rows/s, 0B/s Parallelism: 0.0 Peak Memory: 0B 0.81 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20210908_202515_01110_xfgng failed: io.trino.spi.type.DoubleType java.lang.UnsupportedOperationException: io.trino.spi.type.DoubleType at io.trino.spi.type.AbstractType.writeLong(AbstractType.java:91) at io.trino.parquet.reader.FloatColumnReader.readValue(FloatColumnReader.java:34) at io.trino.parquet.reader.PrimitiveColumnReader.lambda$readValues$2(PrimitiveColumnReader.java:195) at io.trino.parquet.reader.PrimitiveColumnReader.processValues(PrimitiveColumnReader.java:251) at io.trino.parquet.reader.PrimitiveColumnReader.readValues(PrimitiveColumnReader.java:194) at io.trino.parquet.reader.PrimitiveColumnReader.readPrimitive(PrimitiveColumnReader.java:183) at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:370) at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:444) at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:427) at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:224) at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:396) at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:375) at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:282) at io.trino.spi.Page.getLoadedPage(Page.java:279) at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:314) at io.trino.operator.Driver.processInternal(Driver.java:387) at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291) at io.trino.operator.Driver.tryWithLock(Driver.java:683) at io.trino.operator.Driver.processFor(Driver.java:284) at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076) at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163) at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484) at io.trino.$gen.Trino_361____20210908_171253_2.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)