reproio / columnify

Make record oriented data to columnar format.
Apache License 2.0
37 stars 6 forks source link

Upgrade parquet-go to v1.5.3 #59

Closed t2y closed 3 years ago

t2y commented 4 years ago

Description

Why

According to https://github.com/xitongsys/parquet-go/compare/v1.5.2...v1.5.3, there are some changes for optimization.

In my environment, I can confirm that "Maximum resident set size" is reduced by about 15%. I created test data from https://github.com/abicky/docker-log-and-fluent-plugin-s3-with-columnify-example.

$ /usr/bin/time -v path/to/columnify \
  -parquetCompressionCodec SNAPPY \
  -parquetPageSize 8192 \
  -parquetRowGroupSize 33554432 \
  -recordType msgpack \
  -schemaType avro \
  -schemaFile fluentd/docker_log.avsc \
  -output /tmp/2020082412_0.parquet 
  2020082412_0.msgpack

v1.5.2

Maximum resident set size (kbytes): 136936

v1.5.3

Maximum resident set size (kbytes): 117272

What

I upgraded parquet-go library like this.

$ go get -u github.com/xitongsys/parquet-go
go: github.com/xitongsys/parquet-go upgrade => v1.5.3
go: downloading github.com/xitongsys/parquet-go v1.5.3
$ go mod tidy

Concern

I'm not sure how to verify output compatibility between the old version and the new version. I only confirmed that both JSON output converted by parquet-tool is matched. Does anyone have a good way to verify?

Reference

codecov-commenter commented 4 years ago

Codecov Report

Merging #59 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #59   +/-   ##
=======================================
  Coverage   82.77%   82.77%           
=======================================
  Files          15       15           
  Lines         505      505           
=======================================
  Hits          418      418           
  Misses         65       65           
  Partials       22       22           
Flag Coverage Δ
#unittests 82.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d4d6232...9d60158. Read the comment docs.

t2y commented 3 years ago

I guess the reason why is compress package was upgraded, too.

github.com/klauspost/compress v1.10.5

https://github.com/xitongsys/parquet-go/commit/f2077caa4499b8c821e736cd415dbdaa4cbc3852