hive connector io.trino.spi.TrinoException: Unsupported storage format

snowangles commented 1 year ago

Hello,

We have hive tables that use custom input formats and serdes. We noticed that starting with Trino 423 we're no longer able to query these tables.

Query 20230907_171018_00016_mrt64 failed: Unsupported storage format:foobar StorageFormat{serde=CUSTOM SERDE HERE, inputFormat=org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat=org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}
io.trino.spi.TrinoException: Unsupported storage format: foobar StorageFormat{serde=CUSTOM SERDEDE HERE, inputFormat=org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat=org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$4(BackgroundHiveSplitLoader.java:497)
 at java.base/java.util.Optional.orElseThrow(Optional.java:403)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:497)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:400)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:314)
 at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
 at io.trino.$gen.Trino_426____20230907_160032_2.run(Unknown Source)
 at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:79)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 at java.base/java.lang.Thread.run(Thread.java:833)

The issue seems to be a recent change made to BackgroundHiveSplitLoader.java where a call getHiveStorageFormat was introduced which fails when querying table with a format not defined in HiveStorageFormat.

We had to make changes to HiveStorageFormat.java to add our custom serde definitions. This is a really concerning change for us. Why is hive connector all the sudden limited to only those formats defined in HiveStorageFormat?

The documentation page does not reflect that only certain SequenceFile serdes are supported: https://trino.io/docs/current/connector/hive.html

Assuming this change was done by design, what does the roadmap look like for Hive support in Trino?

pangyifish commented 1 year ago

+1, observed the same issue

s905060 commented 1 year ago

+1 We are having the same issue.

shortland commented 1 year ago

+1, without major custom changes - this is completely breaking & blocking any further upgrades to Trino for us.

electrum commented 1 year ago

This is part of the project to decouple Trino from Hadoop and Hive codebases. Can you tell us more about the motivation for using custom input formats or serdes? Would it be feasible for you to convert to a standard format?

realknorke commented 1 year ago

We're using a custom SerDe which is just a wrapper around our own CSV parser. The parser (SFM) is much faster than the default parser shipped with Trino (OpenCSV). Worked well until v423.

snowangles commented 1 year ago

We have custom protobuf and parquet serdes. We are heavily invested in protobufs. For protobufs, we have implemented some custom types for performance reasons. And since our schemas are encoded in protobufs, we have written a parquet serde that can infer the schema from a protobuf.

It'll be a heavily lift to move our infrastructure off of this.

electrum commented 1 year ago

@realknorke Is the CSV input format compatible with Hive OpenCSV? If so, maybe we could replace Trino's CsvDeserializerFactory implementation with the faster version.

electrum commented 1 year ago

@snowangles Thanks for explaining. I'll need to think about this. At the moment, I don't have a good answer for you. You should be able to implement your custom reader in a fork of Trino (or a fork of the Hive connector) by adding your format to HiveStorageFormat and implementing it in HivePageSourceFactory.

realknorke commented 1 year ago

@electrum I don't know for sure whether or not the SFM CSV parser is 100% compatible with OpenCSV for every edge case (as CSV is a format from hell). It should be safer for you Trino guys to stick with Hadoops OpenCSV as default. BUT It would also be good to be able to set an (custom) SerDe implementation as a parameter for Hive (HMS-backed) table creations. This would allow everyone to add custom formats and/or parsers/readers without Trino maintainers to worry about (much).

For the matter at hand, the change in v423 is not a show stopper for us as we can always switch back to the OpenCSV (default) parser by modifying the SerDe information in the HMS. But this would not be ideal.

Is there a good reason to not allow a custom SerDe as parameter for CREATE TABLE? (apart from the work necessary to implement that?)

Just FYI: Here is how you do that for HMS-based tables in Spark:

 CREATE EXTERNAL TABLE family (id INT, name STRING)
    ROW FORMAT SERDE 'com.ly.spark.serde.SerDeExample'
    STORED AS INPUTFORMAT 'com.ly.spark.example.serde.io.SerDeExampleInputFormat'
        OUTPUTFORMAT 'com.ly.spark.example.serde.io.SerDeExampleOutputFormat'
    LOCATION '/tmp/family/';

hashhar commented 1 year ago

not allow a custom SerDe as parameter

Being able to know that things will or will not work. e.g. if the serde is using Hadoop classes it might stop working in future.

realknorke commented 1 year ago

@hashhar can you please explain? Allowing a user to specify a custom SerDe (not maintained by the Trino team) is affecting things to work or not work - how?

electrum commented 1 year ago

The Hive connector no longer depends on Hive classes (for reasons explained here), so it's not possible to support custom Hive serdes. We also took advantage of that to cleanup the code to use the HiveStorageFormat enum in more places, and enums are not extensible, so supporting custom formats would require undoing those changes.

realknorke commented 1 year ago

@electrum thank you for the clarification. This means that Trino, on one hand, is aiming on bringing together multiple data sources, while, on the other hand, is restraining access to multiple data formats. A wide variety of connectors but - when compared to Hive - a limited functionality, when used as a replacement for Hive/Hadoop. :(

That is an unfortunate design decision.

dain commented 1 year ago

I understand you are upset. The decision to drop support for Hive SerDe and Hadoop compression codecs was not made lightly. The Hadoop and Hive codebases are difficult to work with, and not well maintained. Additionally, the community has swiftly moved away from these Hadoop and Hive formats to Parquet and ORC, and they are pushing farther with the switch to Iceberg, Delta Lake, and Hudi. I believe this is a negative reinforcing cycle that is unlikely to change.

Maintaining support for the full breadth of Hadoop/Hive features has been a herculean effort for the past ten yeas, which we happily did because of the vast usage of these systems. However, the usage of these systems has been in decline for years, and the effort to maintain support for them has not been reducing to match, and instead is actually growing as the Hadoop/Hive codebases become more difficult to work with.

This came to a head as we have attempted to add support for adding new features like Dynamic Catalogs #12709. The Hadoop/Hive have critical design flaws that make them incompatible with these new features. The only reasonable way to add these features was to decouple from the Hadoop/Hive codebases. This is a massive effort, and again we happily did it because we could finally reduce the effort required to maintain support for Hadoop/Hive, and actually add these amazing new features.

The where do we go from here. For opensource, popular, well-maintained, formats we will consider adding official support. We maybe be able add interfaces to extend the Hive plugin with new file formats and compression codecs. We have never supported extending the Hive plugin by adding jars to the plugin directory, but a few folks did and had varying degrees of success. If we do add extension points for this, they will be specific to Trino and not use Hadoop/Hive APIs (or have them available in the classpath). This means you would need to adapt your custom format to Trino APIs (I assume if you have a custom format you have programmers). That said, we would need to see a broad community need for this before we would consider adding it (as again this is not something we have ever supported).

realknorke commented 1 year ago

@dain Thank you very much for your thoughts and explanation!

trinodb / trino

hive connector io.trino.spi.TrinoException: Unsupported storage format #19018