googlegenomics / dataflow-java

Google Cloud Dataflow pipelines such as Identity-By-State as well as useful utility classes.
Apache License 2.0
36 stars 31 forks source link

Update ShardedBAMWriting to be compatible with parallel BAM reading. #214

Open deflaux opened 7 years ago

deflaux commented 7 years ago

WriteBAMTransform used by ShardedBAMWriting assumes that it receives reads in order.

Add a group by and a sort operation so that input reads can be read from multiple BAM shards in parallel.