samtools / htslib

C library for high-throughput sequencing data formats
Other
812 stars 446 forks source link

differences in the reads returned samtools view with version #405

Closed dsthughes closed 8 years ago

dsthughes commented 8 years ago

hi, i was using the api to iterate through a range to generate some read statistics and got different results from those from monging the output of a samtools view command direct. eventually i realised it was due to my use of a newer version of the api that the one my path was picking up. when i ran the samtools view command with the same version i got the same results as with the api - i assumed that there were differences in default filtering but couldn't find differences in the bitflags etc., and now it seems that the newer version is just truncated e.g.


~/tools/samtools/bin/samtools 2>&1 | head -3
Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1-36-g613501f (using htslib 1.3.1-55-g5869a67)

/ifs/work/leukgen/opt/cgp/5.18.4/pcap/1.14.0/bin/samtools 2>&1 | head -3
Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.20

/ifs/work/leukgen/opt/cgp/5.18.4/pcap/1.14.0/bin/samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 | wc
   1404   22467  450532

~/tools/samtools/bin/samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 | wc                                 
    801   12817  256961

the output is simply truncated - here's the relevant section with a diff and lines numbered


788 HWI-D00655:23:C6JV6ANXX:5:1310:12511:4782       HWI-D00655:23:C6JV6ANXX:5:1310:12511:4782
    789 HWI-D00655:23:C6JV6ANXX:5:1315:15429:93136      HWI-D00655:23:C6JV6ANXX:5:1315:15429:93136
    790 HWI-D00655:23:C6JV6ANXX:5:1311:19645:92563      HWI-D00655:23:C6JV6ANXX:5:1311:19645:92563
    791 HWI-D00655:23:C6JV6ANXX:5:1316:5282:15426       HWI-D00655:23:C6JV6ANXX:5:1316:5282:15426
    792 HWI-D00655:23:C6JV6ANXX:5:2105:14502:13151      HWI-D00655:23:C6JV6ANXX:5:2105:14502:13151
    793 HWI-D00655:23:C6JV6ANXX:5:1213:6955:62579       HWI-D00655:23:C6JV6ANXX:5:1213:6955:62579
    794 HWI-D00655:23:C6JV6ANXX:5:1110:18009:22359      HWI-D00655:23:C6JV6ANXX:5:1110:18009:22359
    795 HWI-D00655:23:C6JV6ANXX:5:1111:6132:18171       HWI-D00655:23:C6JV6ANXX:5:1111:6132:18171
    796 HWI-D00655:23:C6JV6ANXX:5:2107:9215:8844        HWI-D00655:23:C6JV6ANXX:5:2107:9215:8844
    797 HWI-D00655:23:C6JV6ANXX:5:2111:20913:90160      HWI-D00655:23:C6JV6ANXX:5:2111:20913:90160
    798 HWI-D00655:23:C6JV6ANXX:5:2206:20799:29759      HWI-D00655:23:C6JV6ANXX:5:2206:20799:29759
    799 HWI-D00655:23:C6JV6ANXX:5:2215:11608:47432      HWI-D00655:23:C6JV6ANXX:5:2215:11608:47432
    800 HWI-D00655:23:C6JV6ANXX:5:1208:10149:2083       HWI-D00655:23:C6JV6ANXX:5:1208:10149:2083
    801 HWI-D00655:23:C6JV6ANXX:5:1308:16423:72444      HWI-D00655:23:C6JV6ANXX:5:1308:16423:72444
    802                                               > HWI-D00655:23:C6JV6ANXX:5:2301:8809:5629
    803                                               > HWI-D00655:23:C6JV6ANXX:5:2303:2256:2219
    804                                               > HWI-D00655:23:C6JV6ANXX:5:1104:12549:52658
    805                                               > HWI-D00655:23:C6JV6ANXX:5:1104:4263:89071
    806                                               > HWI-D00655:23:C6JV6ANXX:5:2107:1700:28047
    807                                               > HWI-D00655:23:C6JV6ANXX:5:1115:14874:85213
    808                                               > HWI-D00655:23:C6JV6ANXX:5:1209:5472:62473
    809                                               > HWI-D00655:23:C6JV6ANXX:5:1201:20488:50387
    810                                               > HWI-D00655:23:C6JV6ANXX:5:1211:8004:14034
    811                                               > HWI-D00655:23:C6JV6ANXX:5:1203:14368:53397
    812                                               > HWI-D00655:23:C6JV6ANXX:5:1204:12915:5160
    813                                               > HWI-D00655:23:C6JV6ANXX:5:1309:12604:83130
    814                                               > HWI-D00655:23:C6JV6ANXX:5:1301:19047:8361
    815                                               > HWI-D00655:23:C6JV6ANXX:5:2303:4718:63444
    816                                               > HWI-D00655:23:C6JV6ANXX:5:2304:8408:20032
    817                                               > HWI-D00655:23:C6JV6ANXX:5:2305:3264:21148

are you familiar with this behavior?

Thanks!

dan

dsthughes commented 8 years ago

FWIW: i get the correct result with the last release:

~/tools/samtools/samtools-1.3.1/bin/samtools 2>&1 | head                                                                                                        

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)

~/tools/samtools/samtools-1.3.1/bin/samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 | wc 
   1404   22467  450532
jkbonfield commented 8 years ago

Thanks for the bug report.

If you output to a file, eg > /dev/null and then check $? exit status, was it non-zero? This is curious behaviour, but truncated data sounds like it found an error and tripped up. I wonder what on.

Is the data publically available anywhere?

jmarshall commented 8 years ago

Please try building current samtools with htslib c6cb4c392ad4d1ae937fca35b524d17905a7fec9. If this produces the count of 1404, 5869a67c3c19aa9e2b3b356269dadaf9aff1c669 will be confirmed as the culprit. It is a likely culprit as it affects the region ⇒ iterator computations.

jmarshall commented 8 years ago

To reiterate @jkbonfield's question: is this data available publicly anywhere? If we can reproduce this ourselves, we'll be able to track down what's going on quite quickly.

Alternatively, please post the reads you posted in the diff above, along with their FLAG, POS, and CIGAR fields — so e.g. samtools view … | cut -f1,2,4,6 — and we may be able to draw some conclusions from that.

dsthughes commented 8 years ago

I'm so sorry, this was being filtered in my junk mail. Will do this today.

Dan

On Thursday, August 25, 2016, John Marshall notifications@github.com wrote:

To reiterate @jkbonfield https://github.com/jkbonfield's question: is this data available publicly anywhere? If we can reproduce this ourselves, we'll be able to track down what's going on quite quickly.

Alternatively, please post the reads you posted in the diff above, along with their FLAG, POS, and CIGAR fields — so e.g. samtools view … | cut -f1,2,4,6 — and we may be able to draw some conclusions from that.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htslib/issues/405#issuecomment-242427048, or mute the thread https://github.com/notifications/unsubscribe-auth/ASaXAxT64pYTd_wr4gXSuSydNMtO3Kxiks5qjbM1gaJpZM4Jopoc .

Daniel S. T. Hughes M.Biochem (Hons; Oxford), Ph.D (Cambridge)

dsth@cantab.net dsth@cpan.org

dsthughes commented 8 years ago

And where the hell did that poncy Dr Daniel... Sender name come from - I hate apple!

On Thursday, August 25, 2016, Dr Daniel Hughes daniel.hughes@lmh.oxon.org wrote:

I'm so sorry, this was being filtered in my junk mail. Will do this today.

Dan

On Thursday, August 25, 2016, John Marshall <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

To reiterate @jkbonfield https://github.com/jkbonfield's question: is this data available publicly anywhere? If we can reproduce this ourselves, we'll be able to track down what's going on quite quickly.

Alternatively, please post the reads you posted in the diff above, along with their FLAG, POS, and CIGAR fields — so e.g. samtools view … | cut -f1,2,4,6 — and we may be able to draw some conclusions from that.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htslib/issues/405#issuecomment-242427048, or mute the thread https://github.com/notifications/unsubscribe-auth/ASaXAxT64pYTd_wr4gXSuSydNMtO3Kxiks5qjbM1gaJpZM4Jopoc .

Daniel S. T. Hughes M.Biochem (Hons; Oxford), Ph.D (Cambridge)


dsth@cantab.net javascript:_e(%7B%7D,'cvml','dsth@cantab.net'); dsth@cpan.org javascript:_e(%7B%7D,'cvml','dsth@cpan.org');

Daniel S. T. Hughes M.Biochem (Hons; Oxford), Ph.D (Cambridge)

dsth@cantab.net dsth@cpan.org

jmarshall commented 8 years ago

Great — I'm looking forward to getting to the bottom of this one :smile:

dsthughes commented 8 years ago

Hey, so I'd checked return values etc., when using the API direct but to confirm - two older versions:

/ifs/work/leukgen/opt/cgp/5.18.4/pcap/1.14.0/bin/samtools 2>&1 | perl -ne 'print if($.==3)' Version: 0.1.20 /ifs/work/leukgen/opt/cgp/5.18.4/pcap/1.14.0/bin/samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 | wc
1404 22467 450532 /ifs/work/leukgen/opt/cgp/5.18.4/pcap/1.14.0/bin/samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 > /dev/null ; echo $? 0

~/tools/samtools/samtools-1.3.1/bin/samtools 2>&1 | perl -ne 'print if($.==3)' Version: 1.3.1 (using htslib 1.3.1) ~/tools/samtools/samtools-1.3.1/bin/samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 | wc
1404 22467 450532 ~/tools/samtools/samtools-1.3.1/bin/samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 > /dev/null ; echo $? 0

and the one with an issue:

~/tools/samtools/git/bin/samtools 2>&1 | perl -ne 'print if($.==3)'
Version: 1.3.1-36-g613501f (using htslib 1.3.1-55-g5869a67) ~/tools/samtools/git/bin/samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 | wc
801 12817 256961 ~/tools/samtools/git/bin/samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 > /dev/null ; echo $?
0

dsthughes commented 8 years ago

so alas the data is patient data from a relapse case - they'd cut me into small pieces and feed them to sewer rats for making it available but i'm sure i can send the index if you think that would be useful. i'm just doing make clean, make and running with various commits now.

dan

jmarshall commented 8 years ago

Fair enough re being patient data. What will be interesting is your results with htslib c6cb4c392ad4d1ae937fca35b524d17905a7fec9 and what will hopefully provide sufficient hints to reproduce this with clean data is the QNAME/FLAG/ POS/CIGAR (not the SEQ!!) for that sampling of reads, if that seems sufficiently non-confidential. (Via email just to me if you prefer.)

dsthughes commented 8 years ago

I'd prefer email but of course i'll send those cols. bare with me a mo' i've got 10 things i'm supposed to be doing at once

dsthughes commented 8 years ago

you are indeed correct :

first 5869a67 :

htslib

git show --oneline -s
5869a67 hts_itr_query(): discard chunks far beyond the query region

make clean...

samtools

git show --oneline -s
613501f Exterminate error(), fix memory leaks in event of write errors

make clean...

./samtools 2>&1 | perl -ne 'print if($.==3)'
Version: 1.3.1-36-g613501f (using htslib 1.3.1-55-g5869a67) ./samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 | wc
801 12817 256961

now c6cb4c3

htslib

git show --oneline -s
c6cb4c3 Use finer-grained $(INSTALL_LIB) and $(INSTALL_MAN) macros make clean...

samtools ...

./samtools 2>&1 | perl -ne 'print if($.==3)'
Version: 1.3.1-36-g613501f (using htslib 1.3.1-54-gc6cb4c3) ./samtools view /ifs/res/leukgen/local/opt/leukdc/workflows/W0007220F/data/bam/I-H-103893-T1-1-D1-3.bam 1:9781176-9781176 | wc
1404 22467 450532

jmarshall commented 8 years ago

Since it's via email and we're not spamming the world, I reckon either interval or chromosome. Thanks!

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

mp15 commented 8 years ago

Sanger handles a lot of patient data and even EHRs. If the issue is unsolvable without data perhaps some sort of small scale materials transfer agreement might sooth their nerves?

dsthughes commented 8 years ago

oh silly, me you want the difference alone. please see your email.

jmarshall commented 8 years ago

@dsthughes has informed me that there's no clipping or anything else unusual about the disappearing reads themselves; it's simply all the reads at POS 9781150 and after that are missing. The region of interest is in bin 5277, which spans 9764865–9781248. So it's likely that the missing reads' end positions have crossed the 9781248 boundary taking them up to a larger bin.

This is something to check in the code, but in principle it ought to work. So let's focus on the index too: I guess this is a .bai index? Was it made with samtools index or something else such as Picard?

jmarshall commented 8 years ago

Thanks for building multiple versions and testing etc, @dsthughes. Could you build current samtools with the htslib fix/405 branch and rerun the samtools view query? This will dump out to stderr information about how it's using the index to choose the reads to consider; you can email me or post these half dozen or so lines of standard error here, as they show offsets within the BAM file but no data at all from the BAM file itself.

dsthughes commented 8 years ago

Will do this afternoon. I naively assumed it was during index handling which is why i offered the index. It's a samtools index. Would it be helpful?

On Friday, August 26, 2016, John Marshall notifications@github.com wrote:

Thanks for building multiple versions and testing etc, @dsthughes https://github.com/dsthughes. Could you build current samtools with the htslib fix/405 branch and rerun the samtools view query? This will dump out to stderr information about how it's using the index to choose the reads to consider; you can email me or post these half dozen or so lines of standard error here, as they show offsets within the BAM file but no data at all from the BAM file itself.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/htslib/issues/405#issuecomment-242722821, or mute the thread https://github.com/notifications/unsubscribe-auth/ASaXA3o4TVW3VR72EWIdPRHGjpDmZXomks5qjty8gaJpZM4Jopoc .

Daniel S. T. Hughes M.Biochem (Hons; Oxford), Ph.D (Cambridge)

dsth@cantab.net dsth@cpan.org

jmarshall commented 8 years ago

Sorry, forgot about that… Yes, if you can email me the .bai file that's even better than sending that dumped info.

dsthughes commented 8 years ago

Just want to check the email with the index didn't get filtered?

dan

jmarshall commented 8 years ago

Yep, received thanks.

jmarshall commented 8 years ago

In the region of interest, your index reflects reads interleaved in bin 5277 and its parent, bin 659, as expected if there is a mixture of reads with and without deletions that push their ends past position 9781248:

    4941754:14241   4941754:26016   Bin 5276  (9748481-9764864)
    4941754:26016   9067430:31508   Bin 5277  (9764865-9781248)
    9067430:31508   9067430:38943   Bin 659   (9699329-9830400)
    9067430:38943   9067430:39203   Bin 5277
    9067430:39203   9067430:39460   Bin 659
    9067430:39460   9067430:39720   Bin 5277
    9067430:39720   9067430:44335   Bin 659
    […etc interleaving 5277 and 659 for a while…]
    9225889:21851   9225889:22093   Bin 5277
    9225889:22093   9225889:31641   Bin 659
    9225889:31641   12548989:17833  Bin 5278  (9781249-9797632)

For your query region, the fix/405 logging shows

for min_off, look at bin #5277
min_off from loff in bin #5277
for max_off, look at bin #5278
hts_itr_query: clipping to min_off 4941754:26016, max_off 9067430:31508

and proceeds to omit chunks in bins 659 and 5277(!) due to max_off.

The SAM spec says of BAI files' linear indices:

In the linear index, for each tiling 16384bp window on the reference, we record the smallest file offset of the alignments that start in the window. [Emphasis added.]

Commit 5869a67c3c19aa9e2b3b356269dadaf9aff1c669 was written using the linear index as thus described.

Here we are clipping from file offset 9067430:31508, which has come from the linear index entry for bin 5278 / position 9781249. However that file offset is the offset of a read in bin 659, which must have a starting position of ~100 bases less than 9781248 (as it precedes other reads that are in bin 5277).

So I believe samtools's indexing code is in fact setting linear index entries to “the smallest file offset of the alignments that intersect with the window”, and that it has always done so. And a cursory glance suggests that the relevant Picard code has followed the C code's lead on this. (And it may be that this redefinition is necessary for the original main use for the linear index, calculating min_off. Note that the later CSI spec says “file offset of the first overlapping record” for a CSI field that is derived from the BAI linear index…)

So 5869a67c3c19aa9e2b3b356269dadaf9aff1c669 needs to be either reverted or altered to clip based on the next-to-the-right bin's binning index instead (hopefully). And there is a spec bug to fix or wording to clarify.

(Similar conclusions can also be drawn from @daviesrob's privately-communicated example.)

jmarshall commented 8 years ago

Okay, this should be fixed now on the develop branch. Sorry for the inconvenience!