Open jirassimok opened 3 years ago
I suppose it's not bug. Could you try hive.recursive-directories
?
https://trino.io/docs/current/connector/hive.html?highlight=hive.recursive-directories#hive-configuration-properties
Let me close this issue. Please feel free to reopen if you couldn't resolve by the above property.
@ebyhr if hive.recursive-directories
solves the problem, it would mean our default value for this property is not correct.
-- Trino should read Hive data the same way as Hive does. "read the same unless data was inserted with UNION ALL" does not sound like an easy sell to users.
@findepi If I remember correctly, there were similar discussion about the property. In the past conversation, we applied the same default value as https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L225. This comment is just for sharing the context and it doesn't mean I disagree to change our default value.
Thanks for the pointer @ebyhr ! Maybe this is a different behavior between Hive MR and eg Hive on Tez? If this happens to be the case, we should revisit our defaults, as Tez seems to be the default (at least in Cloudera/Hortonworks distributions).
@findepi I think so. Also, it seems Hive MR is deprecated since 2.x. Maybe, we can change our default value. Reference:
Thanks @ebyhr for researching this
@jirassimok can you give an example of what the table directory looks like after this Hive command? Also, what happens for partitioned tables? (I guess partitioned tables still work fine with subdirectories, since they are always at the leaf)
Also, what version of Hive? It would be good to test this multiple major versions if possible.
One more question is what does Spark do when reading? Does it vary depending on the Spark version?
Here's the table directory for my original example, running on the HDP 3 product test container (with Hive 3.1):
test
|-- HIVE_UNION_SUBDIR_1
|-- HIVE_UNION_SUBDIR_1/000000_0
|-- HIVE_UNION_SUBDIR_2
`-- HIVE_UNION_SUBDIR_2/000000_0
When I run the default product test container, it uses Hive version 1.2, and this issue doesn't seem to show up (and the table directory just has one file (test/000000_0
)).
I can't say anything about Spark, though.
It appears that Spark may suffer similarly: https://issues.apache.org/jira/browse/SPARK-28098 https://stackoverflow.com/questions/46694573/spark-not-able-to-read-hive-table-because-of-1-and-2-sub-folders-in-s3 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-How-to-read-Hive-tables-with-Sub-directories-is-this-supported-td32683.html
When data is inserted into a (non-ORC) Hive table using
UNION ALL
, it is not visible to TrinoOther data is still visible:
This occurs for all storage formats besides ORC (I tested textfile, Sequence File, Avro, Parquet, RCText, and RCBinary.
The issue occurs whether the table is created from Trino or Hive, and it also occurs for
CREATE TABLE AS
withUNION ALL
.