samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

samtools: mpileup -r <regionspec> includes wrong areas #306

Open jblachly opened 8 years ago

jblachly commented 8 years ago

samtools mpileup -r chrA:nnn-mmm includes region chrA:nnn-mmm correctly, but for some combinations of BAM files it also returns pileups from the same coordinates on other contigs, for example chrF:nnn-mmm and chrQ:nnn-mmm

It is not yet clear to me under what circumstances this behaviour is triggered. I have only experienced this when passing a list of bamfiles with -b, and notably when passing the following -b <(head bamfiles.txt | tail -n [1..x]) to narrow down to a specific file that could trigger it, I triggered the behaviour at tail -n 4. However, passing the 4th bamfile ALONE does not trigger the error.

I previously entered this as issue #305 but it was closed in error.

jmarshall commented 8 years ago

[Then as requested on #305, you should have re-opened that issue.]

Please provide enough information so that we can reproduce the problem. If you can't provide that level of information, I'm afraid this will likely just be closed as works-for-us. See also these bug reporting suggestions.

jkbonfield commented 8 years ago

Sorry for closing the ticket, I misunderstood your previous comment. If it only triggers with -r and not -l then it implies an indexing error. I've tried creating synthetic data and can't get this behaviour though. Can you repeat this using any publically available data sets?

Also when you say it returns pileups for the same coordinates on other contigs, do you mean that mpileup is reporting other contig names or that it claims to be reporting the correct contig but the data columns being shown actually come from elsewhere?

Finally, just in case, have you tried reindexing your bams just on the off chance there was a mixup somewhere? (Best take backups first, so you can compare if needed.)

jkbonfield commented 8 years ago

The other thought is, are all the headers identical in all input files? I wouldn't be suprised if having different @SQ headers (eg the order of entries or a differing number of entries) could trigger some interesting merge issues.

jblachly commented 8 years ago

Yes; I will try to list some steps that reproduce this error using publicly available data. Presently we are working with TCGA data which is controlled-access. Sorry for that.

Also, as I am not an htslib/samtools contributor, I do not think I can re-open issues closed by others. Please let me know if this is not the case as I am just transitioning to github from another VCS vendor.

jmarshall commented 8 years ago

Almost all mpileup code executed is the same whether the BAM files are listed on the command line or in a -b flle list, so any problem is unlikely to be restricted to -b.

[I thought non-contributors could re-open their own issues in GitHub's model, but it seems not. If this happens again, probably the best thing to do is just to comment on the recently-closed issue, at which point the maintainers will re-open it.]

jblachly commented 8 years ago

I think your intuition that there could be header problems may be right.

I note that among the 872 BAM files in my set (all RNA-seq samples from the TCGA endometrial cancer set, if anyone on dev team has TCGA access for tool development purposes -- I think lh3 does), some BAM file headers have contigs sorted numerically (chr1,chr2,chr3...) and others have them sorted lexicographically (chr1,chr10,chr11,...chr2,chr20,...).

This seems the most likely cause of the merge issues.

What worries me right now is that I may be getting pileup output from the wrong contigs...

Steps to reproduce: Take any two BAM files (note that despite BAM file header having SO:unsorted, the alignments within contigs are in fact sorted by position; obviously all BAM files are indexed as well as I am using -r) with headers as follows:

@HD     VN:1.0  SO:unsorted
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chrM_rCRS    LN:16569
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
@RG     ID:120307_UNC2-RDR300275_00118_FC_650V8AAXX_8_  PL:illumina     PU:barcode LB:TruSeq       SM:120307_UNC2-RDR300275_00118_FC_650V8AAXX_8_

@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
@SQ     SN:chrM LN:16571
@RG     ID:110921_UNC6-RDR300211_00125_FC_634F4AAXX.1.trimmed   SM:110921_UNC6-RDR300211_00125_FC_634F4AAXX.1.trimmed

And pileup with -r chr16:56839363-56839383 will generate pileups at both the indicated location as well as chr3:56839363-56839383.

Despite this, this error is not triggered every time (I have a list of 20 or so different locations; not all of them will trigger the error. Only about half do, and I am not sure why).

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3-1-g493b1ca (using htslib 1.3-1-gc72ae90)

If there is additional specific data I can provide please let me know.

Kind regards

jblachly commented 8 years ago

I can now confirm that mpileup is returning data from correct contig, but LABELING it with the wrong label in the pileup output.

In the following example, the second file in bamlist23.txt (../bamfiles/5377e432-bac5-432c-bae1-941697dc435b/UNCID_568019.35759844-0020-4b37-bd5a-ede870b7f0ab.110921_UNC6-RDR300211_00125_FC_634F4AAXX.1.trimmed.annotated.translated_to_genomic.bam) has no reads (even spliced) covering the region on chromsome 3. It does have spliced reads over the region on chr16, yet it is reported in the pileup output at chr3, rather than chr16.

So, the little bit of good news is that the code doing the lookups works properly, but there must be an array lookup problem with respect to the contig name.

reproduce-error$ ~/source/github/samtools/samtools mpileup -f ../GCA_000001405.14_GRCh37.p13_no_alt_analysis_set.fna -r chr16:56839363-56839383 -b bamlist23.txt 
Warning: The index file is older than the data file: ../bamfiles/4dab4ccc-cab7-4e36-82ad-f613da33f64d/UNCID_1606302.e527020e-6b3f-4837-87d1-8f7ba68c19d6.sorted_genome_alignments.bam.bai
Warning: The index file is older than the data file: ../bamfiles/5377e432-bac5-432c-bae1-941697dc435b/UNCID_568019.35759844-0020-4b37-bd5a-ede870b7f0ab.110921_UNC6-RDR300211_00125_FC_634F4AAXX.1.trimmed.annotated.translated_to_genomic.bam.bai
[mpileup] 2 samples in 2 input files
<mpileup> Set max per-file depth to 4000
chr16   56839363        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839364        G       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839365        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839366        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839367        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839368        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839369        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839370        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839371        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839372        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839373        G       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839374        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839375        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839376        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839377        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839378        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839379        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839380        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839381        C       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839382        C       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr16   56839383        T       19      ><>>><><<>>><<<>>><     CGAHDFCIGGHBG=9:IF8     0       *       *
chr3    56839363        A       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839364        G       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839365        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839366        A       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839367        A       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839368        T       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839369        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839370        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839371        T       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839372        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839373        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839374        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839375        G       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839376        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839377        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839378        T       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839379        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839380        A       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839381        G       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839382        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
chr3    56839383        C       0       *       *       12      <<<>><<<<><<    @D:FEDC>DEEA
jkbonfield commented 8 years ago

On Wed, Dec 16, 2015 at 10:41:39AM -0800, jblachly wrote:

reproduce-error$ ~/source/github/samtools/samtools mpileup -f ../GCA_000001405.14_GRCh37.p13_no_alt_analysis_set.fna -r chr16:56839363-56839383 -b bamlist23.txt 
Warning: The index file is older than the data file: ../bamfiles/4dab4ccc-cab7-4e36-82ad-f613da33f64d/UNCID_1606302.e527020e-6b3f-4837-87d1-8f7ba68c19d6.sorted_genome_alignments.bam.bai
Warning: The index file is older than the data file: ../bamfiles/5377e432-bac5-432c-bae1-941697dc435b/UNCID_568019.35759844-0020-4b37-bd5a-ede870b7f0ab.110921_UNC6-RDR300211_00125_FC_634F4AAXX.1.trimmed.annotated.translated_to_genomic.bam.bai

Are these warnings correct? If so is this the actual problem, that it simply needs reindexing to get the correct mapping?

(It's seems unlikely though as out of date indices are more likely to cause more fatal decoding errors.)

James

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

jblachly commented 8 years ago

James - the "older than" warnings are just a byproduct of the way the files are downloaded from TCGA. The genetorrent software does not preserve timestamps; the index is often transferred mere seconds before the BAM file. Both are timestamped with creation time on the local system.

James

kirkmcclure commented 8 years ago

The differing order of @SQ header lines, as shown above, causes mpileup problems. A toy data set illustrates the effect - and please note that more than the label is affected -

> ./samtools mpileup -r chr3:2-4 -f /tmp/issue306.fa /tmp/issue306Example1.bam /tmp/issue306Example2.bam
[mpileup] 2 samples in 2 input files
<mpileup> Set max per-file depth to 4000
chr16   2   T   0   *   *   11  AAAAAAAAA-2AAA-2AAA-2AA ~~~~~~~~~~~
chr16   3   A   0   *   *   11  GGGGGG-1AG-1AG-1A***    ~~~~~~~~~~~
chr16   4   A   0   *   *   11  TT+3ACGT+1AT+1CT+1G*+1G*+1G*+1G*+1A*+1A*+1A ~~~~~~~~~~~

> ./samtools mpileup -r chr3:2-4 -f /tmp/issue306.fa /tmp/issue306Example1b.bam /tmp/issue306Example2.bam
[mpileup] 2 samples in 2 input files
<mpileup> Set max per-file depth to 4000
chr3    2   A   0   *   *   11  .........-2GT.-2GT.-2GT ~~~~~~~~~~~
chr3    3   G   0   *   *   11  ......-1T.-1T.-1T***    ~~~~~~~~~~~
chr3    4   T   0   *   *   11  ..+3ACG.+1A.+1C.+1G*+1G*+1G*+1G*+1A*+1A*+1A ~~~~~~~~~~~

> diff /tmp/issue306Example1.sam /tmp/issue306Example1b.sam
2d1
< @SQ   SN:chr16    LN:9
3a3
> @SQ   SN:chr16    LN:9

> grep SQ /tmp/issue306Example1.sam /tmp/issue306Example1b.sam
/tmp/issue306Example1.sam:@SQ   SN:chr16    LN:9
/tmp/issue306Example1.sam:@SQ   SN:chr3 LN:10
/tmp/issue306Example1.sam:@SQ   SN:chr30    LN:12
/tmp/issue306Example1b.sam:@SQ  SN:chr3 LN:10
/tmp/issue306Example1b.sam:@SQ  SN:chr16    LN:9
/tmp/issue306Example1b.sam:@SQ  SN:chr30    LN:12
jmarshall commented 8 years ago

Thanks for the confirmation, @kirkmcclure. Problem exists since 0.1.18 at least.

jmarshall commented 8 years ago

Could be ameliorated by a header API (#293).