trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.1k stars 2.92k forks source link

Change default compression algorithm to ZSTD for ORC / Parquet #12358

Open arhimondr opened 2 years ago

findepi commented 2 years ago

Requires https://github.com/trinodb/trino/issues/9775

losipiuk commented 2 years ago

I chatted with @findepi about it and based on quick investigation it looks like no-go.

For Delta

we cannot change default compression to ZSTD as it is not supported by databricks right now. (todo: find exact failures).

For Hive

There were bunch of experiments already. I think most recently there was a try by @electrum in this PR. I skimmed through error logs and it looks that even on HDP3 (Hive 3.1) ORC with ZSTD is not supported well. There are errors like these:

2022-05-12T23:00:18.6841558Z tests               | Caused by: java.lang.IllegalArgumentException: Unknown compression
2022-05-12T23:00:18.6842101Z tests               |  at org.apache.orc.impl.ReaderImpl.extractPostScript(ReaderImpl.java:448)
2022-05-12T23:00:18.6842587Z tests               |  at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:564)
2022-05-12T23:00:18.6843009Z tests               |  at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370)
2022-05-12T23:00:18.6843446Z tests               |  at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:61)
2022-05-12T23:00:18.6843936Z tests               |  at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:105)
2022-05-12T23:00:18.6844533Z tests               |  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:1647)
2022-05-12T23:00:18.6845178Z tests               |  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1533)
2022-05-12T23:00:18.6845753Z tests               |  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2700(OrcInputFormat.java:1329)
2022-05-12T23:00:18.6846314Z tests               |  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1513)
2022-05-12T23:00:18.6846850Z tests               |  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1510)
2022-05-12T23:00:18.6847336Z tests               |  at java.security.AccessController.doPrivileged(Native Method)
2022-05-12T23:00:18.6847748Z tests               |  at javax.security.auth.Subject.doAs(Subject.java:422)
2022-05-12T23:00:18.6848228Z tests               |  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
2022-05-12T23:00:18.6848777Z tests               |  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1510)
2022-05-12T23:00:18.6849335Z tests               |  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1329)
2022-05-12T23:00:18.6849816Z tests               |  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2022-05-12T23:00:18.6850125Z tests               |  ... 3 more

For iceberg

We are already there

cc: @arhimondr