Dataflow will try to break the file into offset splits of desiredByteSizeBytes, which we've set to 64MB, although binary files should not be split.
Sample repro:
object TestBinaryWrite {
val coder = AvroCoder.of(classOf[Account]) // used to map avro -> bytes
def main(cmdlineArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(cmdlineArgs)
args("method") match {
// produces a singlefile about 280MB
case "write" =>
val records = sc.parallelize(1 to 10_000)
.flatMap(i => (1 to 250).map(_ * i))
.map { i =>
Account
.newBuilder()
.setId(i)
.setAmount(i.toDouble)
.setName(UUID.randomUUID().toString)
.setType("checking")
.build()
}
.map(CoderUtils.encodeToByteArray(coder, _))
.saveAsBinaryFile(args("output"), numShards = 1)
// read the file we just wrote
case "read" =>
sc.binaryFile(args("input"), reader = MyBinaryReader)
}
sc.run()
// A completely meaningless implementation, doesn't matter to demo this bug
case object MyBinaryReader extends BinaryFileReader {
override type State = Int
override def start(is: InputStream): Int = 1
override def readRecord(state: Int, is: InputStream): (Int, Array[Byte]) = {
val buf = new Array[Byte](1000)
is.read(buf)
(state, buf)
}
}
}
My guess is that something is misconfigured with the splittability of the BinaryIO reader implementation? Or there's some method to direct ReadAllViaFileBasedSource not to try to split the source into offset ranges. We should compare against a sample Beam non-splittable source (I think TFRecordIO is such an example?)
Upon following my own advice, I tried copying TFRecordIO's technique of setting desiredBundleSizeBytes of Long.MAX_VALUE to avoid splitting, and it solved the problem 👍
Dataflow will try to break the file into offset splits of
desiredByteSizeBytes
, which we've set to 64MB, although binary files should not be split.Sample repro:
which, on read, throws this error in DF:
My guess is that something is misconfigured with the splittability of the BinaryIO reader implementation? Or there's some method to direct
ReadAllViaFileBasedSource
not to try to split the source into offset ranges. We should compare against a sample Beam non-splittable source (I think TFRecordIO is such an example?)