malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Mosquito short read alignment pipeline - validation #22

Closed alimanfoo closed 2 years ago

alimanfoo commented 4 years ago

Issue for discussion of work to compare outputs of the short read alignment pipeline with expected outputs from previous pipeline implementation.

alimanfoo commented 4 years ago

Surfacing info from @gbggrant:


$ samtools idxstats test.AN0131-C.bam | sort
*   0   0   113112
2L  49364325    8046719 37697
2R  61545105    9509957 43065
3L  41963435    6640126 37168
3R  53200684    8478808 46014
Mt  15363   306422  123
UNKN    42389979    7699610 26255
X   24393108    2701566 32356
Y_unplaced  237045  2605961 1746
$ samtools idxstats truth.AN0131C.bam | sort
*   0   0   4665880
2L  49364325    6869952 87343
2R  61545105    8149276 101376
3L  41963435    5590331 80789
3R  53200684    7027770 93925
UNKN    42389979    6796285 422736
X   24393108    2114527 69684
Y_unplaced  237045  2504227 70917
$ samtools flagstat test.AN0131-C.bam 
46326705 + 0 in total (QC-passed reads + QC-failed reads)
1681687 + 0 secondary
0 + 0 supplementary
2442902 + 0 duplicates
45989169 + 0 mapped (99.27% : N/A)
44645018 + 0 paired in sequencing
22322509 + 0 read1
22322509 + 0 read2
41030516 + 0 properly paired (91.90% : N/A)
44083058 + 0 with itself and mate mapped
224424 + 0 singletons (0.50% : N/A)
1959244 + 0 with mate mapped to a different chr
890066 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools flagstat truth.AN0131C.bam 
44645018 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
3274737 + 0 duplicates
39052368 + 0 mapped (87.47% : N/A)
44645018 + 0 paired in sequencing
22322509 + 0 read1
22322509 + 0 read2
37112858 + 0 properly paired (83.13% : N/A)
38125598 + 0 with itself and mate mapped
926770 + 0 singletons (2.08% : N/A)
515812 + 0 with mate mapped to a different chr
247322 + 0 with mate mapped to a different chr (mapQ>=5)

Some quick stats ^^

‘test’ is the output of the new pipeline. ‘truth’ is the expected output.

Interesting? that the new pipeline aligned to Mt where the old did not.

alimanfoo commented 4 years ago

Interesting? that the new pipeline aligned to Mt where the old did not.

That's expected, we previously aligned to the AgamP3 reference which is exactly the same as AgamP4 expect it doesn't have the Mt.

alimanfoo commented 4 years ago

Hm, stats are totally different. Not sure where to start investigating this. Even the total number of reads is different.

alimanfoo commented 4 years ago

This could be down to differences in the input data, at least in part. Maybe I made a mistake somewhere in pointing to the lanelets to use from ENA. I'll double check that.

However, another very noticeable difference is the fraction of mapped reads, truth has 87% whereas test has 99%. I'm slightly surprised to get 99% of reads mapping in any sample.

alimanfoo commented 4 years ago

@gbggrant could you post idxstats and flagstat comparisons for the other test sample? Curious to see if that also mismatches in a similar way.

gbggrant commented 4 years ago

@alimanfoo Here are the idxstats and flagstat comparisons for AB0252_C:

samtools idxstats test.AB0252-C.bam | sort
*   0   0   108628
2L  49364325    8584887 51411
2R  61545105    10375873    47355
3L  41963435    7155320 41603
3R  53200684    9046717 49025
Mt  15363   429562  165
UNKN    42389979    6957552 24774
X   24393108    4673483 47190
Y_unplaced  237045  39865   585
samtools idxstats truth.AB0252_C.bam | sort
*   0   0   6111882
2L  49364325    6469850 99498
2R  61545105    8852164 115212
3L  41963435    5972921 89995
3R  53200684    7474801 106419
UNKN    42389979    5962884 445034
X   24393108    3837120 89736
Y_unplaced  237045  15221   4887
samtools flagstat test.AB0252-C.bam
47633995 + 0 in total (QC-passed reads + QC-failed reads)
1986371 + 0 secondary
0 + 0 supplementary
322316 + 0 duplicates
47263259 + 0 mapped (99.22% : N/A)
45647624 + 0 paired in sequencing
22823812 + 0 read1
22823812 + 0 read2
41953716 + 0 properly paired (91.91% : N/A)
45014780 + 0 with itself and mate mapped
262108 + 0 singletons (0.57% : N/A)
2051784 + 0 with mate mapped to a different chr
932352 + 0 with mate mapped to a different chr (mapQ>=5)
samtools flagstat truth.AB0252_C.bam
45647624 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
864369 + 0 duplicates
38584961 + 0 mapped (84.53% : N/A)
45647624 + 0 paired in sequencing
22823812 + 0 read1
22823812 + 0 read2
36864882 + 0 properly paired (80.76% : N/A)
37634180 + 0 with itself and mate mapped
950781 + 0 singletons (2.08% : N/A)
422608 + 0 with mate mapped to a different chr
215717 + 0 with mate mapped to a different chr (mapQ>=5)
alimanfoo commented 4 years ago

Thanks @gbggrant, I now realise that I have made a snafu here, my apologies. The "truth" BAMs I pointed you to in fact come from a previous version of the pipeline which used bwa aln/sampe and not the more recent version which used bwa mem and should match your implementation. That would explain the big different in the percentage of reads mapped, I believe bwa mem is a much more sensitive aligner.

The correct truth BAM files for these two samples (AB0252-C, AN0131-C.bam) are on lustre at Sanger, I'll put them somewhere publicly accessible and update the fixture here on github.

alimanfoo commented 4 years ago

Here's some stats for the correct truth BAMs (located here at Sanger: /lustre/scratch118/malaria/team112/pipelines/setups/vo_agam/output/vo_agam_indelrealign/) ...

$ samtools flagstat AB0252-C.bam
47633790 + 0 in total (QC-passed reads + QC-failed reads)
639447 + 0 duplicates
47263055 + 0 mapped (99.22%:-nan%)
47633790 + 0 paired in sequencing
23794474 + 0 read1
23839316 + 0 read2
42918707 + 0 properly paired (90.10%:-nan%)
46947793 + 0 with itself and mate mapped
315262 + 0 singletons (0.66%:-nan%)
3347291 + 0 with mate mapped to a different chr
1138823 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools idxstats AB0252-C.bam
2R  61545105    10375637    47276
3R  53200684    9047045 48995
2L  49364325    8585966 51444
UNKN    42389979    6955138 24858
3L  41963435    7155615 41609
X   24393108    4674250 47189
Y_unplaced  237045  39829   567
Mt  15363   429575  169
*   0   0   108628
$ samtools flagstat AN0131-C.bam
46326540 + 0 in total (QC-passed reads + QC-failed reads)
3261375 + 0 duplicates
45989009 + 0 mapped (99.27%:-nan%)
46326540 + 0 paired in sequencing
23103905 + 0 read1
23222635 + 0 read2
41802625 + 0 properly paired (90.23%:-nan%)
45721722 + 0 with itself and mate mapped
267287 + 0 singletons (0.58%:-nan%)
3038022 + 0 with mate mapped to a different chr
1059942 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools idxstats AN0131-C.bam
2R  61545105    9509422 43199
3R  53200684    8478536 45918
2L  49364325    8047104 37656
UNKN    42389979    7699440 26403
3L  41963435    6641715 37019
X   24393108    2700347 32358
Y_unplaced  237045  2606018 1742
Mt  15363   306427  124
*   0   0   113112
alimanfoo commented 4 years ago

Also submitted PR #23 to correct the fixture data in this repo.

alimanfoo commented 4 years ago

Statistics comparison now looking much better :smile:

gbggrant commented 4 years ago

Thanks. It's too late now, but I should have posted the @PG record from the previously referenced 'truth' bam, which shows the usage of bwa aln and other differing tools:


@PG ID:bam_calculate_bq2    PN:samtools PP:gatk_indel_realigner_gatk22  VN:0.1.18-r572  CL:samtools calmd -Erb $bam_file $reference_fasta > $bq_bam_file
@PG ID:bam_fix_bam  PN:picard   PP:sam_to_fixed_bam VN:1.96 CL:java $jvm_args -jar CleanSam.jar INPUT=$bam_file OUTPUT=$fixed_bam_file VALIDATION_STRINGENCY=SILENT
@PG ID:bam_fix_mates    PN:picard   PP:bam_fix_bam  VN:1.96 CL:java $jvm_args -jar FixMateInformation.jar INPUT=$bam_file OUTPUT=$fixmate_bam_file SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=0
@PG ID:bwa_aln_fastq    PN:bwa  PP:bwa_index    VN:0.6.2-r126   CL:bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file
@PG ID:bwa_index    PN:bwa  VN:0.6.2-r126   CL:bwa index -a bwtsw $reference_fasta
@PG ID:bwa_sam  PN:bwa  PP:bwa_aln_fastq    VN:0.6.2-r126   CL:bwa sampe -a 1500 -r $rg_line -f $sam_file $reference_fasta $sai_file(s) $fastq_file(s)
@PG ID:gatk_indel_realigner_gatk22  PN:GenomeAnalysisTK PP:gatk_realigner_target_creator_gatk22 VN:2.4-9-g532efad   CL:java $jvm_args -jar GenomeAnalysisTK.jar -T IndelRealigner -R $reference_fasta -I $bam_file -o $realigned_bam_file -targetIntervals $intervals_file -LOD 0.4 -model USE_SW -compress 0
@PG ID:gatk_left_align_indels_gatk2 PN:GenomeAnalysisTK PP:bam_fix_mates    VN:2.4-9-g532efad   CL:java $jvm_args -jar GenomeAnalysisTK.jar -T LeftAlignIndels -R $ref -I $bam_files -o $indels_left_aligned_bam_file 
@PG ID:gatk_realigner_target_creator_gatk22 PN:GenomeAnalysisTK PP:gatk_left_align_indels_gatk2 VN:2.4-9-g532efad   CL:java $jvm_args -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference_fasta -I $input_bam -o $intervals_file 
@PG ID:sam_to_fixed_bam PN:samtools PP:bwa_sam  VN:0.1.18-r572  CL:samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file
@PG ID:GATK IndelRealigner  VN:2.4-9-g532efad   CL:knownAlleles=[] targetIntervals=/lustre/scratch110/malaria/ag1000g/output/c/3/4/b/92899/1_gatk_realigner_target_creator_gatk2/pe.1.bam.intervals LODThresholdForCleaning=0.4 consensusDeterminationModel=USE_SW entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null```
gbggrant commented 4 years ago

Belatedly pointing to the outputs of runs of AN0131-C and AB0252-c through the Short Read Alignment Pipeline: AN0131-C has cromwell id: 99d5d6b4-4307-4eca-9ef5-119f55051692, it's workflow execution directory is: /lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692

AB0252-C has cromwell id: fa0f695b-f043-4d20-95d1-1ae6aa684882, it's workflow execution directory is: /lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882

Outputs from the pipeline for AN0131-C: "outputs": { "ShortReadAlignment.indel_realigner_interval_list_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-RealignerTargetCreator/cacheCopy/execution/AN0131-C.intervals", "ShortReadAlignment.callable_loci_summary_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-GatkCallableLoci/cacheCopy/execution/AN0131-C.gatk_callable_loci.summary.txt", "ShortReadAlignment.samtools_stats_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-SamtoolsStats/execution/AN0131-C.samtools_stats_report.txt", "ShortReadAlignment.validate_samfile_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ValidateSamFile/cacheCopy/execution/AN0131-C.validation_report.txt", "ShortReadAlignment.output_sams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignment/shard-0/cacheCopy/execution/AN0131-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignment/shard-1/cacheCopy/execution/AN0131-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignment/shard-2/cacheCopy/execution/AN0131-C.sam" ], "ShortReadAlignment.output_bam": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-IndelRealigner/cacheCopy/execution/AN0131-C.bam", "ShortReadAlignment.output_bams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignmentPostProcessing/shard-0/cacheCopy/execution/AN0131-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignmentPostProcessing/shard-1/cacheCopy/execution/AN0131-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/99d5d6b4-4307-4eca-9ef5-119f55051692/call-ReadAlignmentPostProcessing/shard-2/cacheCopy/execution/AN0131-C.bam" ] },

Outputs from the pipeline for AB0252-C: "outputs": { "ShortReadAlignment.indel_realigner_interval_list_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-RealignerTargetCreator/execution/AB0252-C.intervals", "ShortReadAlignment.callable_loci_summary_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-GatkCallableLoci/execution/AB0252-C.gatk_callable_loci.summary.txt", "ShortReadAlignment.samtools_stats_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-SamtoolsStats/execution/AB0252-C.samtools_stats_report.txt", "ShortReadAlignment.validate_samfile_report_file": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ValidateSamFile/execution/AB0252-C.validation_report.txt", "ShortReadAlignment.output_sams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignment/shard-0/execution/AB0252-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignment/shard-1/execution/AB0252-C.sam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignment/shard-2/execution/AB0252-C.sam" ], "ShortReadAlignment.output_bam": "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-IndelRealigner/execution/AB0252-C.bam", "ShortReadAlignment.output_bams": [ "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignmentPostProcessing/shard-0/execution/AB0252-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignmentPostProcessing/shard-1/execution/AB0252-C.bam", "/lustre/scratch118/malaria/team112/pipelines/cromwell_production/ShortReadAlignment/fa0f695b-f043-4d20-95d1-1ae6aa684882/call-ReadAlignmentPostProcessing/shard-2/execution/AB0252-C.bam" ] }, Let me know if you'd like me to copy them elsewhere.

alimanfoo commented 2 years ago

Closing as complete.