Hadoop now supports a vectored read API optimized for seek() heavy workloads; to use it, your file system must implement a new method readVectored in its positional stream class. Parquet 1.14+ supports vectored IO via the parquet.hadoop.vectored.io.enabledconfiguration option, and gcs-connector implements readVectored for GCS file streams in 3.0.1+.
Currently we're blocked from upgrading to gcs-connector 3.x until Beam does (I ran into a NoSuchMethodError on the Beam side when I tried just upgrading Scio). once Beam is released with gcs-connector we should benchmark vectored IO, particularly on Parquet SMB reads.
Another note, gcs-connector 3.x drops Java 8 support , so we're blocked on #5067. Additionally, it has some breaking API changes, so we need Beam to upgrade first: https://github.com/apache/beam/issues/31896
Hadoop now supports a vectored read API optimized for seek() heavy workloads; to use it, your file system must implement a new method
readVectored
in its positional stream class. Parquet 1.14+ supports vectored IO via theparquet.hadoop.vectored.io.enabled
configuration option, and gcs-connector implementsreadVectored
for GCS file streams in 3.0.1+.Currently we're blocked from upgrading to gcs-connector 3.x until Beam does (I ran into a NoSuchMethodError on the Beam side when I tried just upgrading Scio). once Beam is released with gcs-connector we should benchmark vectored IO, particularly on Parquet SMB reads.