Allow selecting "logical" compression codec in Hive connector

losipiuk commented 2 years ago

Currently, Hive connector allows for selecting a specific compression codec to be used on the write path. The selection is done via hive.compression-codec config property (or compression_codec session property) and it can be one of the following values:

NONE
SNAPPY
LZ4
ZSTD
GZIP (default)

The selected codec is applied regardless of the written file format (the same codec will be used for the table using CSV and ORC). It is not optimal. The current default (GZIP) is good for writing CSV files, but for ORC files it would be better to use ZSTD. It is probably not common for single user to write both CSV and ORC files (though not heard of), but surely not every user uses same file format. And defaults configuration should be select in such way that it covers as many uscases as possible.

Therefore we propose to extend the list of codecs with two more "logical" ones:

BEST
FAST

Those would map to different compression codecs depending on file format, and BEST is supposed to pick the strongest compression codec supported by used file format. FAST is supposed to pick the fastest codec.

Having above also allows for better elasticity if new, better codecs become supported in the future.

Proposed mapping file format to codec (to be discussed and tested)

File format	BEST codec	FAST codec	Notes
ORC	ZSTD	LZ4	depends on https://github.com/trinodb/trino/issues/9775
PARQUET	ZSTD	LZ4
AVRO	SNAPPY	SNAPPY	referred as "optional" codec here
RCBINARY	GZIP	GZIP	rarely used - keep old default
RCTEXT	GZIP	GZIP	rarely used - keep old default
SEQUENCEFILE	GZIP	GZIP	rarely used - keep old default
JSON	GZIP	GZIP	rarely used - keep old default
TEXTFILE	GZIP	GZIP	.gz is natural for text files
CSV	GZIP	GZIP	.gz is natural for text files

Next step is to change default value of hive.compression-codec to BEST.

cc: @arhimondr, @dain, @electrum, @findepi

electrum commented 2 years ago

This doesn’t make sense to me, since GZIP is neither the best nor the fastest. We choose it by default for compatibility. Users are free to choose ZSTD or LZ4 for CSV files if they want.

We want to have a DEFAULT compression setting where we pick ZSTD if possible. This is a curated default choice for each format, based on compatibility.

Using LZ4 seems like a poor choice since it increases network usage which can be a bottleneck. If users want that, they can choose it directly.

Unless there’s a compelling benchmark result to show that LZ4 is useful for ORC or Parquet, we shouldn’t consider having both logical options.

findepi commented 2 years ago

Therefore we propose to extend the list of codecs with two more "logical" ones:

BEST

i believe ZSTD is "best" for ORC. I also think it's "best" for CSV, but we don't choose it for compatibility reasons. The "BEST" option name doesn't convey this complexity.

We want to have a DEFAULT compression setting where we pick ZSTD if possible. This is a curated default choice for each format, based on compatibility.

"DEFAULT" is an awesome choice to add.

Let's make DEFAULT the default option and let's make it mean ZSTD where compatibility allows us to.

losipiuk commented 2 years ago

So IIUC you both (@findepi and @electrum) are opting for just a single option (name of the option BEST/DEFAULT to be decided). I am fine with that, yet it is not what @dain suggested during offline conversation. @dain can you also speak up?

findepi commented 2 years ago

So IIUC you both (@findepi and @electrum) are opting for just a single option (name of the option BEST/DEFAULT to be decided).

Yes, I am for single new option, called "DEFAULT". (While i could see arguments for "BEST", I think it's not a better name for what we want to achieve.)

losipiuk commented 2 years ago

New finding: Newest Hive on EMR does not support ZSTD for ORC currently. @findepi, @electrum , do you think it makes it no-go to use it as a default compression for ORC? If so maybe we should still have the BEST compression set with ZSTD set for ORC, but we would not make it on by default, just an opt-in. WDYT?

dain commented 2 years ago

I was proposing that we have HIGH and FAST. HIGH compression would be ZSTD (or GZ if unavailable) and FAST would be LZ4 (or Snappy if unavailable.. never, ever, LZO). If we believe that ZSTD is the fastest also then it might make sense to only have one option, but it makes deciding the fallback compression (GZ or SNAPPY) complex... then again it looks like only AVRO has limited compression support.

dain commented 2 years ago

BTW it looks like AVRO has supported ZSTD since 1.9.

losipiuk commented 2 years ago

@dain There is still the question of compatibility with Hive - which I am still not sure if we care about. I would say we should not have a default that would make Hive users not be able to read tables written by Trino.

I checked on the newest EMR (with Hive 3.1.2) and for Parquet we are good. For compressions supported by Trino, Hive can read Trino-written table.

For ORC the Hive cannot read ZSTD (others are good).

For Avro situation is more complex. It looks that currently Trino always writes AVRO as either un-compressed or using ZLIB. AvroRecordWriter expects avro.output.codec to be set in JobConf - and we never set that. If want to use different codecs we need to change the code, and there may be dragons. And good find on ORC. Indeed currently AVRO supports most Trino does (all but LZ4 - [here], Edit: we are also missing GZIP here; I though DEFLATE is same thing - pardon my ignorance :/)(https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/CodecFactory.java#L135-L140) ).

Text formats (tested using CSV) can be read by Hive for all compression codecs.

cc: @electrum @findepi

findepi commented 2 years ago

New finding: Newest Hive on EMR does not support ZSTD for ORC currently. @findepi, @electrum , do you think it makes it no-go to use it as a default compression for ORC

it seems so. We cannot just start writing files that people cannot read.

@losipiuk what about Athena? Does it support ZSTD?
@pettyjamesm can AWS make EMR Hive support ZSTD?
@losipiuk what about Parquet?
- while we default to ORC, many users use Parquet, so improving Parquet compression would still be a big win

findepi commented 2 years ago

I checked on the newest EMR (with Hive 3.1.2) and for Parquet we are good. For compressions supported by Trino, Hive can read Trino-written table.

let's improve for Parquet first, then this is awesome!

pettyjamesm commented 2 years ago

@pettyjamesm can AWS make EMR Hive support ZSTD?

It's possible, but they would have to add it to their roadmap to prioritize adding support for it. If it's easy enough to add, it might be better to add support to Hive itself, no?

losipiuk commented 2 years ago

@dain, @electrum I wanted to move this forward on spare cycles. While there are some concerns about the different formats, we can gradually settle for some (at least Parquet looks good already). Yet I need a decision if we move with two options (HIGH/FAST?) or a single one (DEFAULT). I am leaning to single DEFAULT. Are you both good with that?

raunaqmorarka commented 2 years ago

JFYI in Apache Spark the default compression codec for parquet is SNAPPY since version 2.0. Also, LZ4 compression scheme has been deprecated by parquet-format in favor of LZ4_RAW. cc: @sopel39

sopel39 commented 2 years ago

FAST is supposed to pick the fastest codec.

What does FAST mean? FAST for decompression/compression, fast for later select queries? Codec might be fast (e.g. UNCOMPRESSED), but produce such big files, that later scans are IO bounded and thus slow

losipiuk commented 2 years ago

I assume HIGH is "great compression not necessarily super CPU efficient" and FAST is "reasonable compression but super CPU efficient" - not exact science. Let's just go with DEFAULT (whatever Trino developers think is best) so we have lest decisions to make :)

raunaqmorarka commented 2 years ago

It does seem better to just set a default per file format, it's not easy for users to choose whether FAST or BEST is more appropriate for them. FAST can easily make IO the bottleneck which could be "slow" for query execution. I would vote for ZSTD as first choice for default with SNAPPY as the second if ZSTD can't be chosen due to some compatibility concerns.

electrum commented 2 years ago

Just having DEFAULT sounds good to me.

trinodb / trino

Allow selecting "logical" compression codec in Hive connector #12580