Benchmark Vectored IO support for scio-parquet/scio-smb

spotify / scio

A Scala API for Apache Beam and Google Cloud Dataflow.

Apache License 2.0

2.55k stars 514 forks source link

Hadoop now supports a vectored read API optimized for seek() heavy workloads; to use it, your file system must implement a new method readVectored in its positional stream class. Parquet 1.14+ supports vectored IO via the parquet.hadoop.vectored.io.enabled configuration option, and gcs-connector implements readVectored for GCS file streams in 3.0.1+.

Currently we're blocked from upgrading to gcs-connector 3.x until Beam does (I ran into a NoSuchMethodError on the Beam side when I tried just upgrading Scio). once Beam is released with gcs-connector we should benchmark vectored IO, particularly on Parquet SMB reads.

spotify / scio

Benchmark Vectored IO support for scio-parquet/scio-smb #5430