spotify / scio

A Scala API for Apache Beam and Google Cloud Dataflow.
https://spotify.github.io/scio
Apache License 2.0
2.55k stars 514 forks source link

Support projections in ParquetAvroFileOperations/ParquetAvroSortedBucketIO #5082

Open clairemcginty opened 8 months ago

clairemcginty commented 8 months ago

ParquetAvroFileOperations always overrides the "projection" option to equal the full reflected schema, so you can't supply a projection for a SpecificRecord class:

https://github.com/spotify/scio/blob/110f79593c67c58a2c2465bf2fb340ff4711003f/scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/ParquetAvroFileOperations.java#L175-L176

clairemcginty commented 8 months ago

5083 provides a workaround for this via the Configuration parameter:

val projection: Schema = ...
val configuration = ParquetConfiguration.empty()
AvroReadSupport.setRequestedProjection(configuration, projection)

val read = ParquetAvroSortedBucketIO
  .read(tupleTag, classOf[TestRecord])
  .from(...)
  .withConfiguration(configuration)

In 0.14 we can add projection as a Builder method to ParquetAvroSortedBucketIO