Open ArvinZheng opened 4 years ago
Also wanted to share another finding here, in Hive 2.3.4, when orc.force.positional.evolution
is set to true, all columns are index mapped by their ordinal except nested struct
which matches current Presto behavior I'm gonna start another conversation with ORC folks to see if this is intentional or they also wanted to address it, will ping here if I have any updates.
@ArvinZheng Do you have some updates please from ORC folks ? I'm looking for a solutions also, but I'm in the first steps of choices which data format do I uses for our project. Is ORC support schema evolution with Presto or no ? If I set hive.orc.use-column-names to true or I force orc.force.positional.evolution to true , is ORC will accept an evolution in the schema ?
Thanks in advance.
@Sarrouna , yes, the issue has been fixed in https://issues.apache.org/jira/browse/ORC-626, one new config item orc.force.positional.evolution.level
is added to determine how many levels of nested types will be read by indexes.
BTW, orc.force.positional.evolution
is a config item of Apache ORC while Presto maintains its own ORC readers, adding it to your Presto config wouldn't work.
We recently upgraded 2 of our Presto clusters to 0.208 and 317 and found that after upgraded, Presto changed the default schema evolution for struct to
name based
instead ofpositional
, and does not provide an option for positional mapping.For example, following query runs fine in 0.180 and data for
cost.raw_cost
is returned as expected.After moved to 0.208 or 317, the following query always returns null for
cost.raw_cost
Note:
cost.raw_cost
is different -raw_cost_micros
raw_cost
toraw_cost_micros
to match the ORC metadata, we are able to get the correct dataI noticed that the change was introduced in from https://github.com/prestodb/presto/pull/11123/files .
IMO, when we are talking about default behaviors of Hive, the version of Hive should always be involved. IIRC currently the default schema evolution in Hive is
orc.force.positional.evolution
is provided to force positionalBut the default in Presto is positional for all other columns and name based for struct which is not aligned with any Hive version. I understand aligning default behaviors with Hive is not easy and #1558 has been created to track that, but before we are able to make a decision and implement #1558, should we think about addressing current issue?
Updating the column name in Hive to match ORC is not that easy to us, we have multiple Hive columns whose name does not match to ORC file, and we also have many downstream consumers which already subscribed to this table and rely on current Hive table definition.
What I can think of now is
hive.orc.use-column-names
is set totrue
Both are not ideal but maybe option 2 is safer as it won't break current default behavior (as people may have been relying on this to change their ORC structs).
@dain, @findepi feel free to comment, cc: @martint