SvABA Pipeline Adoption

gongyixiao commented 3 years ago

SvABA (Broad pipeline): https://dockstore.org/containers/registry.hub.docker.com/weischenfeldt/svaba_workflow https://github.com/walaj/svaba

[x] Validate the usability, editability and reproducibility Dockerfile and the docker image
[x] Gather all the necessary reference files
[x] Develop a test dataset which can successfully run for unit test
[x] Convert the Dockstore.cwl into nextflow friendly scripts
[x] Use real data to test it
[x] Integrate it into TEMPO
[x] Performance evaluation and optimization
[x] Result check

stevekm commented 3 years ago

I started working on this here; /juno/work/ci/kellys5/projects/tempo-SNV-dev/svaba-dev Got a demo SvABA workflow running in Nextflow.

I did not use the Dockerfile that they had in that link since it included other stuff that we probably do not need, but I used it as the basis for a new "cleaner" Dockerfile at that location on Juno that only contains the latest version of SvABA. Also included Dockerhub -> Singularity conversion with the Dockerhub entry pushed here; https://hub.docker.com/r/mskcc/svaba

Also I skipped the CWL they included as well and just went straight to the svaba run command invocation in Nextflow, since it was not clear to me what added value they had in their CWL and .py wrapper scripts

So far, its running on the demo Tempo tumor/normal file with 32 threads on LSF via Singularity container

gongyixiao commented 3 years ago

Would you mind to start a new branch and push your Dockerfile there?

stevekm commented 3 years ago

the SvABA test run finished, it took about 5hrs on 32 cores, used a max of 14GB memory, LSF job 51066477

the command was;

svaba run     -t "tumor.bam"     -n "normal.bam"     -G "human_g1k_v37_decoy.fasta"     -p "32"     -z

SvABA Version: 1.1.3 (4d7b57); this is the latest one, newer than the one from the paper

input tumor and normal .bam files were about 260GB and 130GB, output files produced;

2.7G    no_id.alignments.txt.gz
120M    no_id.bps.txt.gz
1.7G    no_id.contigs.bam
21M     no_id.discordant.txt.gz
73M     no_id.log
50M     no_id.svaba.germline.indel.vcf.gz
640K    no_id.svaba.germline.sv.vcf.gz
256K    no_id.svaba.somatic.indel.vcf.gz
64K     no_id.svaba.somatic.sv.vcf.gz
66M     no_id.svaba.unfiltered.germline.indel.vcf.gz
56M     no_id.svaba.unfiltered.germline.sv.vcf.gz
4.2M    no_id.svaba.unfiltered.somatic.indel.vcf.gz
6.0M    no_id.svaba.unfiltered.somatic.sv.vcf.gz
4.8G    total

excerpt from the SvABA console log;

-----------------------------------------------------------
---  Running svaba SV and indel detection on 32 threads ---
---    (inspect *.log for real-time progress updates)   ---
-----------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
--- Loaded non-read data. Starting detection pipeline
...vcf - reading in the breakpoints file
...vcf sizeof empty VCFEntryPair 64 bytes
...read in 1,140,104 indels and 524,380 SVs
...vcf - deduplicating 524,380 events
...dedupe at 0 of 524,380
...dedupe at 250,000 of 524,380
...dedupe at 500,000 of 524,380
...hashing 1,140,104 indels for dedupe
...done deduping indels
...vcf - deduplicated down to 421,953 break pairs
Done with SvABA

stats on the bam files used;

$ samtools flagstat tumor.bam
2479210921 + 0 in total (QC-passed reads + QC-failed reads)
10561445 + 0 secondary
0 + 0 supplementary
619619675 + 0 duplicates
2476520908 + 0 mapped (99.89% : N/A)
2468649476 + 0 paired in sequencing
1234324738 + 0 read1
1234324738 + 0 read2
2415292220 + 0 properly paired (97.84% : N/A)
2464074034 + 0 with itself and mate mapped
1885429 + 0 singletons (0.08% : N/A)
30457898 + 0 with mate mapped to a different chr
24625037 + 0 with mate mapped to a different chr (mapQ>=5)

$ samtools flagstat normal.bam
1170519487 + 0 in total (QC-passed reads + QC-failed reads)
3845353 + 0 secondary
0 + 0 supplementary
272347508 + 0 duplicates
1169335058 + 0 mapped (99.90% : N/A)
1166674134 + 0 paired in sequencing
583337067 + 0 read1
583337067 + 0 read2
1150083750 + 0 properly paired (98.58% : N/A)
1164497582 + 0 with itself and mate mapped
992123 + 0 singletons (0.09% : N/A)
6578048 + 0 with mate mapped to a different chr
4398841 + 0 with mate mapped to a different chr (mapQ>=5)

The flagstat output suggests there are a lot of duplicates in the input .bam files I used, and the SvABA seemed to spend a lot of time on the deduplication steps, so I am wondering if maybe it would run faster if the bam files were deduplicated ahead of time.

I will do another test run with the version of SvABA used in the paper (4a0606).

gongyixiao commented 3 years ago

Thank you for this very informative comment.

Some thoughts on this:

We need a way to be able to disable the germline calls because of the consent issue.
Is it really using 32 cores when running? What's the % cpu reported in the nextflow's trace.txt file? It's not realistic to give 32 cores when we are running many samples on juno, since most of the node only have 40 cores. Of course the resource allocation needs more investigation and can be done later.
From the source code I'm reading in their github (https://github.com/walaj/svaba/blob/d12cf224f7a488b913eabbcf54a215e17238032c/src/svaba/vcf.cpp#L415-L419), it’s doing his own kind of dedup, which works together with marked duplicates in the bam. I think it's worth to investigate if dedup based on the marked duplicate in the bam can reduce the run time significantly, and at the same time, make sure we get the same result with or without dedup at first. What I will do is:

process SvABA {

input: ...
output: ...
script:
"""
date
samtools view -F 1024 -i tumor.bam -o tumor.dedup.bam
date
samtools view -F 1024 -i normal.bam -o normal.dedup.bam
date
svaba run -t tumor.dedup.bam -n normal.dedup.....
date
"""
}

stevekm commented 3 years ago

Nextflow is reporting CPU usage of 2300% so ~23 cores, but I wasnt able to check in htop on the node so I am not sure how accurate that is.

Can germline calls just be filtered from the output afterwards? I will double check the SvABA args and see if theres anything about that in there

I can set up a test run with the samtool filtering you have there. Do we know if these flags are applied to all the bam files that we would run through SvABA or do we need to do a pre-processing step to get the flags applied?

gongyixiao commented 3 years ago

I think it's better to disable germline calls completely if we can.

And the flag should be universal to all the bams we use here. So no other preprocessing is needed. But please double check 1024 is the correct flag.

在 2020年10月30日，上午4:42，Stephen Kelly notifications@github.com 写道：

Nextflow is reporting CPU usage of 2300% so ~23 cores, but I wasnt able to check in htop on the node so I am not sure how accurate that is.

Can germline calls just be filtered from the output afterwards? I will double check the SvABA args and see if theres anything about that in there

I can set up a test run with the samtool filtering you have there. Do we know if these flags are applied to all the bam files that we would run through SvABA or do we need to do a pre-processing step to get the flags applied?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

stevekm commented 3 years ago

I tried adding the samtools view steps as described, but it caused the run time for the job to go over the LSF default time limit (5hr); this is even longer than SvABA alone took to run on the bam without filtering. So maybe trying to pre-filter and dedup the bams will just add more time to execution. Not sure if there are other steps in the pipeline where we are doing this already?

stevekm commented 3 years ago

Re-ran the same test with the Dockerhub container provided with the paper using SvABA version 134 (4a0606), copied the results at this location;

/juno/work/ci/kellys5/projects/tempo-SNV-dev/tempo-dev-svaba1/snv/SvABA/output-test2-134-4a0606/82b0ccd449a6e8d52934e57b8189f2

snippet from the console log for SvABA;

...vcf - reading in the breakpoints file
...vcf sizeof empty VCFEntryPair 64 bytes
...read in 1,139,579 indels and 485,638 SVs
...vcf - deduplicating 485,638 events
...dedupe at 0 of 485,638
...dedupe at 250,000 of 485,638
...hashing 1,139,579 indels for dedupe
...done deduping indels
...vcf - deduplicated down to 397,310 break pairs
Done with SvABA

So it looks like by default it is doing slightly different filtering of bam alignments, with a smaller number included in this one vs. the newer version

output files;

2.8G    no_id.alignments.txt.gz
105M    no_id.bps.txt.gz
1.7G    no_id.contigs.bam
17M     no_id.discordant.txt.gz
60M     no_id.log
47M     no_id.svaba.germline.indel.vcf.gz
1.4M    no_id.svaba.germline.indel.vcf.gz.tbi
480K    no_id.svaba.germline.sv.vcf.gz
64K     no_id.svaba.germline.sv.vcf.gz.tbi
240K    no_id.svaba.somatic.indel.vcf.gz
48K     no_id.svaba.somatic.indel.vcf.gz.tbi
48K     no_id.svaba.somatic.sv.vcf.gz
16K     no_id.svaba.somatic.sv.vcf.gz.tbi
62M     no_id.svaba.unfiltered.germline.indel.vcf.gz
1.5M    no_id.svaba.unfiltered.germline.indel.vcf.gz.tbi
49M     no_id.svaba.unfiltered.germline.sv.vcf.gz
1008K   no_id.svaba.unfiltered.germline.sv.vcf.gz.tbi
4.0M    no_id.svaba.unfiltered.somatic.indel.vcf.gz
384K    no_id.svaba.unfiltered.somatic.indel.vcf.gz.tbi
4.9M    no_id.svaba.unfiltered.somatic.sv.vcf.gz
400K    no_id.svaba.unfiltered.somatic.sv.vcf.gz.tbi
4.7G    total

normal bam: /juno/work/tempo/wgs/Results/v1.3.x/bams/s_C_000676_N002_d/s_C_000676_N002_d.bam
tumor bam: /juno/work/tempo/wgs/Results/v1.3.x/bams/s_C_000676_M002_d/s_C_000676_M002_d.bam

mskcc / tempo

SvABA Pipeline Adoption #849