Open losipiuk opened 2 years ago
This doesn’t make sense to me, since GZIP is neither the best nor the fastest. We choose it by default for compatibility. Users are free to choose ZSTD or LZ4 for CSV files if they want.
We want to have a DEFAULT compression setting where we pick ZSTD if possible. This is a curated default choice for each format, based on compatibility.
Using LZ4 seems like a poor choice since it increases network usage which can be a bottleneck. If users want that, they can choose it directly.
Unless there’s a compelling benchmark result to show that LZ4 is useful for ORC or Parquet, we shouldn’t consider having both logical options.
Therefore we propose to extend the list of codecs with two more "logical" ones:
- BEST
i believe ZSTD is "best" for ORC. I also think it's "best" for CSV, but we don't choose it for compatibility reasons. The "BEST" option name doesn't convey this complexity.
We want to have a DEFAULT compression setting where we pick ZSTD if possible. This is a curated default choice for each format, based on compatibility.
"DEFAULT" is an awesome choice to add.
Let's make DEFAULT the default option and let's make it mean ZSTD where compatibility allows us to.
So IIUC you both (@findepi and @electrum) are opting for just a single option (name of the option BEST
/DEFAULT
to be decided). I am fine with that, yet it is not what @dain suggested during offline conversation. @dain can you also speak up?
So IIUC you both (@findepi and @electrum) are opting for just a single option (name of the option
BEST
/DEFAULT
to be decided).
Yes, I am for single new option, called "DEFAULT". (While i could see arguments for "BEST", I think it's not a better name for what we want to achieve.)
New finding: Newest Hive on EMR does not support ZSTD for ORC currently. @findepi, @electrum , do you think it makes it no-go to use it as a default compression for ORC? If so maybe we should still have the BEST
compression set with ZSTD set for ORC, but we would not make it on by default, just an opt-in. WDYT?
I was proposing that we have HIGH and FAST. HIGH compression would be ZSTD (or GZ if unavailable) and FAST would be LZ4 (or Snappy if unavailable.. never, ever, LZO). If we believe that ZSTD is the fastest also then it might make sense to only have one option, but it makes deciding the fallback compression (GZ or SNAPPY) complex... then again it looks like only AVRO has limited compression support.
BTW it looks like AVRO has supported ZSTD since 1.9.
@dain There is still the question of compatibility with Hive - which I am still not sure if we care about. I would say we should not have a default that would make Hive users not be able to read tables written by Trino.
I checked on the newest EMR (with Hive 3.1.2) and for Parquet we are good. For compressions supported by Trino, Hive can read Trino-written table.
For ORC the Hive cannot read ZSTD (others are good).
For Avro situation is more complex.
It looks that currently Trino always writes AVRO as either un-compressed or using ZLIB. AvroRecordWriter expects avro.output.codec
to be set in JobConf - and we never set that.
If want to use different codecs we need to change the code, and there may be dragons. And good find on ORC. Indeed currently AVRO supports most Trino does (all but LZ4 - [here], Edit: we are also missing GZIP here; I though DEFLATE is same thing - pardon my ignorance :/)(https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/CodecFactory.java#L135-L140) ).
Text formats (tested using CSV) can be read by Hive for all compression codecs.
cc: @electrum @findepi
New finding: Newest Hive on EMR does not support ZSTD for ORC currently. @findepi, @electrum , do you think it makes it no-go to use it as a default compression for ORC
it seems so. We cannot just start writing files that people cannot read.
I checked on the newest EMR (with Hive 3.1.2) and for Parquet we are good. For compressions supported by Trino, Hive can read Trino-written table.
let's improve for Parquet first, then this is awesome!
@pettyjamesm can AWS make EMR Hive support ZSTD?
It's possible, but they would have to add it to their roadmap to prioritize adding support for it. If it's easy enough to add, it might be better to add support to Hive itself, no?
@dain, @electrum I wanted to move this forward on spare cycles. While there are some concerns about the different formats, we can gradually settle for some (at least Parquet looks good already).
Yet I need a decision if we move with two options (HIGH/FAST?) or a single one (DEFAULT).
I am leaning to single DEFAULT
. Are you both good with that?
JFYI in Apache Spark the default compression codec for parquet is SNAPPY since version 2.0. Also, LZ4 compression scheme has been deprecated by parquet-format in favor of LZ4_RAW. cc: @sopel39
FAST is supposed to pick the fastest codec.
What does FAST mean? FAST for decompression/compression, fast for later select queries? Codec might be fast (e.g. UNCOMPRESSED), but produce such big files, that later scans are IO bounded and thus slow
I assume HIGH
is "great compression not necessarily super CPU efficient" and FAST
is "reasonable compression but super CPU efficient" - not exact science.
Let's just go with DEFAULT
(whatever Trino developers think is best) so we have lest decisions to make :)
It does seem better to just set a default per file format, it's not easy for users to choose whether FAST or BEST is more appropriate for them. FAST can easily make IO the bottleneck which could be "slow" for query execution. I would vote for ZSTD as first choice for default with SNAPPY as the second if ZSTD can't be chosen due to some compatibility concerns.
Just having DEFAULT
sounds good to me.
Currently, Hive connector allows for selecting a specific compression codec to be used on the write path. The selection is done via
hive.compression-codec
config property (orcompression_codec
session property) and it can be one of the following values:The selected codec is applied regardless of the written file format (the same codec will be used for the table using CSV and ORC). It is not optimal. The current default (GZIP) is good for writing CSV files, but for ORC files it would be better to use ZSTD. It is probably not common for single user to write both CSV and ORC files (though not heard of), but surely not every user uses same file format. And defaults configuration should be select in such way that it covers as many uscases as possible.
Therefore we propose to extend the list of codecs with two more "logical" ones:
Those would map to different compression codecs depending on file format, and BEST is supposed to pick the strongest compression codec supported by used file format. FAST is supposed to pick the fastest codec.
Having above also allows for better elasticity if new, better codecs become supported in the future.
Proposed mapping file format to codec (to be discussed and tested)
Next step is to change default value of
hive.compression-codec
toBEST
.cc: @arhimondr, @dain, @electrum, @findepi