Parquet files benefit greatly when similar data values are colocated in the same page, row group, or file:
Low-cardinality, high-repetition columns create the most efficient dictionary encodings, which can drastically reduce total file size and improve predicate pushdown performance
Statistics filtering, which can be applied at the page or row group level, can potentially filter out entire files if the relevant filter columns are colocated properly
This is very hard to do in Scio, or in distributed data processing engines in general, because the data is by default parallelized and unordered. The closest we have right now is SMB, where you can group and sort by up to 2 columns.
However, for non-SMB use cases, we should be able to leverage Beam's ShardingFunction to colocate data efficiently. We could offer a custom implementation of ShardingFunction that could assign shard # based on a hash of user-specified column(s), for example:
case class User(userId: String, date: DateTime, age: Int)
val data: SCollection[User] = ...
data.saveAsTypedParquetFile(
path,
shardBy = ShardBy[User](numShards = 1024, columns = Set(_.userId, _.age))
)
class ShardBy[T](numShards: Int, columns: Set[FilteringColumn]) extends ShardingFunction[T] { ... }
(...basically a low-powered SMB that doesn't cost much on the write side.)
We should also look into z-ordering, which would incur more penalty on write performance but potentially unlock even greater downstream performance gains.
Parquet files benefit greatly when similar data values are colocated in the same page, row group, or file:
This is very hard to do in Scio, or in distributed data processing engines in general, because the data is by default parallelized and unordered. The closest we have right now is SMB, where you can group and sort by up to 2 columns.
However, for non-SMB use cases, we should be able to leverage Beam's ShardingFunction to colocate data efficiently. We could offer a custom implementation of
ShardingFunction
that could assign shard # based on a hash of user-specified column(s), for example:(...basically a low-powered SMB that doesn't cost much on the write side.)
We should also look into z-ordering, which would incur more penalty on write performance but potentially unlock even greater downstream performance gains.