go-sif / sif

Sif is a framework for fast, predictable, general-purpose distributed computing in the map/reduce paradigm.
Apache License 2.0
32 stars 3 forks source link

Columnar partition data format #29

Closed Ghnuberath closed 3 years ago

Ghnuberath commented 3 years ago

Switching Sif's internal partition data structure to be columnar rather than row-oriented. The obvious benefits are earned, with the additional benefit of substantial simplifications in partition allocation and serialization logic. In particular, a generated protobuf struct for partitions is now used directly to store data, shortening what was an internal struct -> proto -> serialize -> compress chain to proto -> serialize -> compress.

The next step after this PR will be per-column compression rather than overall partition compression, which should yield runtime memory and compute benefits, as only those columns which are used by a stage will need to be decompressed (the rest can pass through already-compressed).

Since column deletion is now trivial, the Repack concept and associated operation has been deprecated. Columns and associated data will be deleted on-demand.