alignment optimization suggestions

yuankunzhu commented 6 years ago

BWA Mem (scattered by read group)

Current setup:

Picard SamToFastq | BWA | Picard MergeBamAlignment

Proposed setup:

adamstruck commented 6 years ago

I believe your missing the merge step from the proposed setup above.

The PCAWG project had success using the biobambam2 suite of tools.

Using those the setup would be:

bamtofastq | BWA | bamsormadup | bammerge

bogdang989 commented 6 years ago

After some testing this is what is recommended:

Extract RG info from header && Picard SamToFastq | BWA | Samblaster | Sambamba View | Sambamba Sort

Example command line looks like this:

/opt/samtools-1.7/samtools  view -H /root_samtools_split/2895820049.bam | 
grep ^@RG > RGINFO && sed -i 's/ /\t/g' RGINFO && >&2 cat RGINFO && 
java -Xms5000m -jar /picard.jar SamToFastq INPUT=/root_samtools_split/2895820049.bam FASTQ=/dev/stdout INTERLEAVE=true NON_PF=true | 
bwa mem -R $(awk '{ print $1"\\t"$2"\\t"$3"\\t"$4"\\t"$5"\\t"$6 }' RGINFO) -K 100000000 -p -v 3 -t 16 -Y /Homo_sapiens_assembly38.fasta - | 
/opt/samblaster/samblaster -i /dev/stdin -o /dev/stdout | 
/opt/sambamba_v0.6.0 view -t 16 -f bam -l 0 -S /dev/stdin | 
/opt/sambamba_v0.6.0 sort -t 16 --natural-sort -m 5GiB --tmpdir ./ -o 2895820049.aligned.unsorted.bam -l 5 /dev/stdin

We returned Picard SamToFastq since samtools bam2fq did not do a great job in conversion - a significant number of the reads were converted as single-end for some reason which might affect the final outcome. Extract RG info from header part is some linux maneuvering to get the RG info from input BAM header (using BWA -R parameter), since in the original setup the RG information was added to the output BAM with Picard MergeBAMAlignment.

For one random pilot sample tested, average single BWA job duration is reduced from ~35min to ~28min and duplicate marking is done on-the-fly, saving nearly 5 hours needed for executing Picard Markduplicates.

@adamstruck Merge step is not included in the pipe, since alignment step is scattered by read group, so merge comes in later as a separate step in the current flow.
Regarding biobambam2 suite, bamsormadup might also be an option instead of the samblaster+sambamba combo in the current proposed setup, definitely could be worth exploring.
As for biobambam2 bam2fastq, I know it was also used in the GDC TCGA harmonization pipeline. I am not sure if this tool has an option to output interleaved fastq files, so it couldn't be just swapped into the pipe and it would probably require some reworking of the workflow configuration. Also, the output should also be checked and compared to the Picard execution. Other than that, it also might be worth exploring as an option.

bogdang989 commented 6 years ago

Inputs with a single read group are aligned twice as long because of using 18 threads. An expression tool should be added to autodetect number of read groups in input BAM and set threads: If there is 1 RG, set 36; Else set 18

Done here https://github.com/bogdang989/kf-alignment-workflow/commit/b4a6228370a750324b200b6054ea60eddc36a9e3

bogdang989 commented 6 years ago

@adamstruck I tested the biobambam bamtofastq and it indeed has the option to output interleaved fastq. What's even better it has the option to restore original quality scores before recalibration, thus fully allowing the use of samtools split instead of picard revertsam. So the proposed setup now would be:

Extract RG info from header && biobambam2 bamtofastq | BWA | Samblaster | Sambamba View | Sambamba Sort

Added this change here: https://github.com/bogdang989/kf-alignment-workflow/commit/0e9d63c78f8dd1eccb83117d93d12d730d89a69b

Still didn't test bamsormadup.

yuankunzhu commented 6 years ago

@adamstruck do you know whether bamsormadup is library aware when handling the duplicate reads marking?

kids-first / kf-alignment-workflow

alignment optimization suggestions #47

Current setup:

Proposed setup: