Open yuankunzhu opened 6 years ago
I believe your missing the merge step from the proposed setup above.
The PCAWG project had success using the biobambam2 suite of tools.
Using those the setup would be:
bamtofastq | BWA | bamsormadup | bammerge
After some testing this is what is recommended:
Extract RG info from header && Picard SamToFastq | BWA | Samblaster | Sambamba View | Sambamba Sort
Example command line looks like this:
/opt/samtools-1.7/samtools view -H /root_samtools_split/2895820049.bam |
grep ^@RG > RGINFO && sed -i 's/ /\t/g' RGINFO && >&2 cat RGINFO &&
java -Xms5000m -jar /picard.jar SamToFastq INPUT=/root_samtools_split/2895820049.bam FASTQ=/dev/stdout INTERLEAVE=true NON_PF=true |
bwa mem -R $(awk '{ print $1"\\t"$2"\\t"$3"\\t"$4"\\t"$5"\\t"$6 }' RGINFO) -K 100000000 -p -v 3 -t 16 -Y /Homo_sapiens_assembly38.fasta - |
/opt/samblaster/samblaster -i /dev/stdin -o /dev/stdout |
/opt/sambamba_v0.6.0 view -t 16 -f bam -l 0 -S /dev/stdin |
/opt/sambamba_v0.6.0 sort -t 16 --natural-sort -m 5GiB --tmpdir ./ -o 2895820049.aligned.unsorted.bam -l 5 /dev/stdin
We returned Picard SamToFastq since samtools bam2fq did not do a great job in conversion - a significant number of the reads were converted as single-end for some reason which might affect the final outcome. Extract RG info from header part is some linux maneuvering to get the RG info from input BAM header (using BWA -R parameter), since in the original setup the RG information was added to the output BAM with Picard MergeBAMAlignment.
For one random pilot sample tested, average single BWA job duration is reduced from ~35min to ~28min and duplicate marking is done on-the-fly, saving nearly 5 hours needed for executing Picard Markduplicates.
@adamstruck Merge step is not included in the pipe, since alignment step is scattered by read group, so merge comes in later as a separate step in the current flow.
Regarding biobambam2 suite, bamsormadup might also be an option instead of the samblaster+sambamba combo in the current proposed setup, definitely could be worth exploring.
As for biobambam2 bam2fastq, I know it was also used in the GDC TCGA harmonization pipeline. I am not sure if this tool has an option to output interleaved fastq files, so it couldn't be just swapped into the pipe and it would probably require some reworking of the workflow configuration. Also, the output should also be checked and compared to the Picard execution. Other than that, it also might be worth exploring as an option.
Inputs with a single read group are aligned twice as long because of using 18 threads. An expression tool should be added to autodetect number of read groups in input BAM and set threads: If there is 1 RG, set 36; Else set 18
Done here https://github.com/bogdang989/kf-alignment-workflow/commit/b4a6228370a750324b200b6054ea60eddc36a9e3
@adamstruck I tested the biobambam bamtofastq and it indeed has the option to output interleaved fastq. What's even better it has the option to restore original quality scores before recalibration, thus fully allowing the use of samtools split instead of picard revertsam. So the proposed setup now would be:
Extract RG info from header && biobambam2 bamtofastq | BWA | Samblaster | Sambamba View | Sambamba Sort
Added this change here: https://github.com/bogdang989/kf-alignment-workflow/commit/0e9d63c78f8dd1eccb83117d93d12d730d89a69b
Still didn't test bamsormadup.
@adamstruck do you know whether bamsormadup
is library aware when handling the duplicate reads marking?
BWA Mem (scattered by read group)
Current setup:
Picard SamToFastq | BWA | Picard MergeBamAlignment
Proposed setup:
Samtools bam2fq | BWA | Samblaster | (mark duplicates here and remove picard_markduplicates; It makes sense to mark duplicates only within the same read group and saves a lot of time by piping it with BWA) | Sambamba view (sam to bam) | Sambamba Sort