This PR adds a script to rescue the partially successful results from running the GATK SV single sample workflow. When running this workflow for a genome, frequently all sub-workflows except Manta succceed. And if Manta doesn't succeed, none of the files from any of the successful sub-workflows are copied to the main bucket, nor are analyses logged in Metamist.
The script:
Takes some input cromwell workflow IDs and datasets the workflows belong to
Fetches the workflow metadata from the cromwell API
Parses out the successful and unsuccessful sub-workflows
Collects all successfully created outputs and copies them into the dataset bucket in the sv_evidence folder.
Then, creates analysis entries for each successful sub workflow, limited to the 'scramble', 'wham', and 'manta' sub-workflows.
Since only the service account can access the cromwell API token via the cpg_utils.cromwell.get_cromwell_oauth_token, I have been testing with a workflow metadata json that I downloaded from the cromwell API endpoint swagger page.
With a local JSON from one of the failed workflows, here are the results with --dry-run
Workflow Status for ID ('test_workflow_2',):
Dataset: my-dataset, Sequencing Group ID: CPGxxxxxx
Scramble: Done
LocalizeReads: Done
CollectCounts: Done
Manta: Running
Whamg: Done
CollectSVEvidence: Done
6 outputs found:
Scramble:
vcf: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-Scramble/Scramble/yyy-yyy-yyy-yyy-yyy/call-ScramblePart2/CPGxxxxxx.scramble.vcf.gz
index: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-Scramble/Scramble/yyy-yyy-yyy-yyy-yyy/call-ScramblePart2/CPGxxxxxx.scramble.vcf.gz.tbi
LocalizeReads:
CollectCounts:
counts: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectCounts/CPGxxxxxx.counts.tsv.gz
Manta:
Whamg:
vcf: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-Whamg/Whamg/yyy-yyy-yyy-yyy-yyy/call-RunWhamgOnCram/CPGxxxxxx.wham.vcf.gz
index: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-Whamg/Whamg/yyy-yyy-yyy-yyy-yyy/call-RunWhamgOnCram/CPGxxxxxx.wham.vcf.gz.tbi
CollectSVEvidence:
split_out_index: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.sr.txt.gz.tbi
sd_out: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.sd.txt.gz
disc_out: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.pe.txt.gz
split_out: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.sr.txt.gz
disc_out_index: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.pe.txt.gz.tbi
sd_out_index: gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.sd.txt.gz.tbi
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-Scramble/Scramble/yyy-yyy-yyy-yyy-yyy/call-ScramblePart2/CPGxxxxxx.scramble.vcf.gz to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.scramble.vcf.gz
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-Scramble/Scramble/yyy-yyy-yyy-yyy-yyy/call-ScramblePart2/CPGxxxxxx.scramble.vcf.gz.tbi to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.scramble.vcf.gz.tbi
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectCounts/CPGxxxxxx.counts.tsv.gz to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.counts.tsv.gz
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-Whamg/Whamg/yyy-yyy-yyy-yyy-yyy/call-RunWhamgOnCram/CPGxxxxxx.wham.vcf.gz to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.wham.vcf.gz
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-Whamg/Whamg/yyy-yyy-yyy-yyy-yyy/call-RunWhamgOnCram/CPGxxxxxx.wham.vcf.gz.tbi to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.wham.vcf.gz.tbi
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.sr.txt.gz.tbi to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.sr.txt.gz.tbi
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.sd.txt.gz to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.sd.txt.gz
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.pe.txt.gz to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.pe.txt.gz
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.sr.txt.gz to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.sr.txt.gz
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.pe.txt.gz.tbi to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.pe.txt.gz.tbi
DRY RUN: Would have copied gs://cpg-seqr-main-tmp/cromwell/GatherSampleEvidence/xxx-xxx-xxx-xxx-xxx/call-CollectSVEvidence/CollectSVEvidence/yyy-yyy-yyy-yyy-yyy/call-RunCollectSVEvidence/CPGxxxxxx.sd.txt.gz.tbi to gs://cpg-my-dataset-main/sv_evidence/CPGxxxxxx.sd.txt.gz.tbi
No manta outputs found for CPGxxxxxx.
Dataset: my-dataset
Sequencing Group ID: CPGxxxxxx, Would create: 2 SV analyses
In this case, the scramble and whamg sub-workflows succeeded, as did the CollectCounts and CollectSVEvidence sub-workflows. So, we copy all these files across to the datasets main bucket into the sv_evidence/ prefix, and we create two SV analyses, one for the scramble result and one for the whamg result.
This PR adds a script to rescue the partially successful results from running the GATK SV single sample workflow. When running this workflow for a genome, frequently all sub-workflows except Manta succceed. And if Manta doesn't succeed, none of the files from any of the successful sub-workflows are copied to the main bucket, nor are analyses logged in Metamist.
The script:
Since only the service account can access the cromwell API token via the
cpg_utils.cromwell.get_cromwell_oauth_token
, I have been testing with a workflow metadata json that I downloaded from the cromwell API endpoint swagger page.With a local JSON from one of the failed workflows, here are the results with
--dry-run
In this case, the scramble and whamg sub-workflows succeeded, as did the CollectCounts and CollectSVEvidence sub-workflows. So, we copy all these files across to the datasets main bucket into the
sv_evidence/
prefix, and we create two SV analyses, one for the scramble result and one for the whamg result.