spotify / scio

A Scala API for Apache Beam and Google Cloud Dataflow.
https://spotify.github.io/scio
Apache License 2.0
2.55k stars 514 forks source link

Benchmark Vectored IO support for scio-parquet/scio-smb #5430

Open clairemcginty opened 1 month ago

clairemcginty commented 1 month ago

Hadoop now supports a vectored read API optimized for seek() heavy workloads; to use it, your file system must implement a new method readVectored in its positional stream class. Parquet 1.14+ supports vectored IO via the parquet.hadoop.vectored.io.enabled configuration option, and gcs-connector implements readVectored for GCS file streams in 3.0.1+.

Currently we're blocked from upgrading to gcs-connector 3.x until Beam does (I ran into a NoSuchMethodError on the Beam side when I tried just upgrading Scio). once Beam is released with gcs-connector we should benchmark vectored IO, particularly on Parquet SMB reads.

clairemcginty commented 1 month ago

Another note, gcs-connector 3.x drops Java 8 support , so we're blocked on #5067. Additionally, it has some breaking API changes, so we need Beam to upgrade first: https://github.com/apache/beam/issues/31896