googlegenomics / dataflow-java

Google Cloud Dataflow pipelines such as Identity-By-State as well as useful utility classes.
Apache License 2.0
36 stars 31 forks source link

Identity By State variant streamer fails with write error #213

Open pbilling opened 7 years ago

pbilling commented 7 years ago

I am trying to apply the IdentityByState pipeline to my variant data, but it reliably (n=4) fails with a write error.

Error message:

(dd6b6f2b6ea510df): Workflow failed. Causes: (dd6b6f2b6ea5110a): S07:VariantStreamer/ParDo(RetrieveVariants)+VariantStreamer/ParDo(ConvergeVariantsList)+JoinNonVariantSegmentsWithVariants.BinShuffleAndCombineTransform/ParDo(BinVariants)+JoinNonVariantSegmentsWithVariants.BinShuffleAndCombineTransform/GroupByKey/Reify+JoinNonVariantSegmentsWithVariants.BinShuffleAndCombineTransform/GroupByKey/Write failed.

Command:

$ java -cp target/google-genomics-dataflow-v1-0.8-SNAPSHOT-runnable.jar com.google.cloud.genomics.dataflow.pipelines.IdentityByState \
--project=gbsc-gcp-project-mvp \
--variantSetId=17987177733120369382 \
--runner=BlockingDataflowPipelineRunner \
--stagingLocation=gs://gbsc-gcp-project-mvp-group/test/dataflow-java/ibs/mvp-phase-2/staging \
--references=chr17:41196311:41277499 \
--hasNonVariantSegments \
--output=gs://gbsc-gcp-project-mvp-group/test/dataflow-java/ibs/mvp-phase-2/result/17987177733120369382-n1820-ibs.tsv

I'm trying right now with the "--hasNonVariantSegments" flag removed, but this is data generated from gVCF files, after processing with the non-variant-segment-transformer, so it should have non-variant segments.

I'm not really sure what this means or how I can go about debugging it. Any ideas are greatly appreciated!

deflaux commented 7 years ago

With --hasNonVariantSegments it is performing the same merge as is used in https://github.com/googlegenomics/codelabs/tree/master/Java/PlatinumGenomes-variant-transformation so that it has all the genotypes for each SNP site in the cohort as input to one of the similarity measures.

I recommend clicking through to the detailed logs and seeing if there is any more info there. The mostly likely issue is an OutOfMemory exception somewhere because the merge operation needs to co-locate all data for a contiguous genomic region on a single machine to perform the merge. If that is the issue, I recommend trying highmem machines. If you still see an OOM, then try smaller genomics regions by decreasing the value of --binSize.

This implementation of Identity-By-State reads from VariantStore. If it were updated to alternatively read from BigQuery, it could instead consume the result of https://github.com/googlegenomics/codelabs/tree/master/Java/PlatinumGenomes-variant-transformation so that the merge does not need to happen twice.