Open arthurpassos opened 2 months ago
Maybe this is due to the cardinality of my data, but still looks weird
Yes, for me the issue is not in the application of the compression codec, it's in your data.
If your content is completely random, and each field of your file is different, compression algorithms will not be able to find repeated content. In the real world, identifiers are repeated, or values are in a certain range.
My proposal to evaluate parquet compression algorithms is to create a limited List of random values, and take values randomly from it:
List<Long> randomLongs = generateLongValues(1000);
List<String> randomStrings = generateValues(2000);
for (int i = 0; i < n; i++) {
int int32 = random.nextInt(10000);
long int64 = randomLongs.get(random.nextInt(1000));
boolean some_boolean = random.nextBoolean();
String byte_array = randomStrings.get(random.nextInt(2000));
String flba = randomStrings.get(random.nextInt(2000));
MyRecord record = new MyRecord(int32, int64, some_boolean, byte_array, flba);
records.add(record);
}
I have observed the same for some data sets. My understanding is that Parquet uses compression techniques like dictionary encoding (sometimes even delta encoding) even without compression codec and for some data sets there is not much more savings to be made with a codec. With say JSON a compressor like zstd can reduce size up to say 90% while I in my latest tests with Parquet not even got 50% and going beyond compression level 2 did not provide much improvement (only required more CPU time). In my case Zstd level 2 provided as good result as GZIP level 6 with less CPU overhead.
Yes, parquet applies multiple strategies encoding a column data before trying to compress the resulting binary array: https://parquet.apache.org/docs/file-format/data-pages/encodings/
If I run the same code, but with
CompressionCodecName.GZIP
, I get a very similar file size:Inspecting the file shows space saved ~0%: