Closed moonso closed 9 years ago
Thanks for this PR.
Regarding setting the default start position, this was originally discussed issue #27. My thought then was that we only need to look at the first read in a BAM file, since the reads need to be sorted anyway and so if there are any unaligned reads they should be at the end of the file. With the current change, one could end up examining every read in a potentially very large BAM file to set the default location, which could take a very long time. Thoughts?
Ok, sorry I haden't seen that discussion. In one of my test cases this solution solved the problem with spotting the start position.
On might put a limit on the number of reads to go through and if a mapped read is not found among those the program exits?
For your test case, was the bam file sorted? Was it perhaps sorted by name rather than coordinate?
We could limit the number of reads to go through as you suggest, but first I'd like to understand why your example didn't work with the current method.
Ok I've checked this now. The first two reads in my file have 0 in mapping quality and is flagged as unmapped. I guess that the reason for them to not be in the end of the file is that their mates are mapped, but I'm not 100%. Anyway there might be different explanations to this that is also dependent on what software that performed the mapping. So maybe it's a good thing to look further into the bam than only the first read?
Was your BAM file indexed and sorted by coordinate? Could you possibly paste the SAM file with the first ~5 reads here? Anyway, if this is the case, I agree it could make sense to check the first n reads where n is something small.
HISEQ:63:H82C9ADXX:1:2215:2626:72342 101 20 60076 0 * = 60076 0 TCCTACTGAGAGAAACCAAAATATTGAGTAAACTATCCCACTTTGAACAGAATTTTTAAGAGAAAAAACTGAAAGTTAATAGAGAGGTGACTCAGATCCAG BBBFFFFFFFFFFIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFF<BBFFFFFFFFFFF ZA:Z:<@;0;;;0;;><&;27;;;1;101M;101> RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 HWI-D00450:67:H8AHNADXX:1:1111:11290:19503 101 20 60076 0 * = 60076 0 TCCTACTGAGAGAAACCAAAATATTGAGTAAACTATCCCACTTTGAACAGAATTTTTAAGAGAAAAAACTGAAAGTTAATAGAGAGGTGACTCAGATCCAG BBBFFFFFFFFFFIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFF ZA:Z:<@;0;;;0;;><&;27;;;1;101M;101> RG:Z:118-1-2A.140128_H8AHNADXX_GATCAG.lane1 HISEQ:63:H82C9ADXX:1:2215:2626:72342 153 20 60076 27 101M = 60076 0 AACAGGTGGTAAGGAAGGAGAGAGTGAAGGAACTGCCAGGTGACACACTCCCACCATGGACCTCTGGGATCCTAGCTTTAAGAGATCCCATCACCCACATG FFFFFFFFFFFFBBFFFFFFFFFFFFFFFFFBFFFFIIIIIIFFFBBIIIIFBIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFBBB ZA:Z:&;0;;;0;;><@;27;;;1;; MD:Z:101 RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 NM:i:0 HWI-D00450:67:H8AHNADXX:1:1111:11290:19503 1177 20 60076 27 101M = 60076 0 AACAGGTGGTAAGGAAGGAGAGAGTGAAGGAACTGCCAGGTGACACACTCCCACCATGGACCTCTGGGATCCTAGCTTTAAGAGATCCCATCACCCACATG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFIIIIIIIIFFBIIIIFFIIIIIFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFBBB ZA:Z:&;0;;;0;;><@;27;;;1;; MD:Z:101 RG:Z:118-1-2A.140128_H8AHNADXX_GATCAG.lane1 NM:i:0 HISEQ:63:H82C9ADXX:1:1210:6887:65615 99 20 60760 33 101M = 61000 340 GTAACCTGTTTGTCAGCCACAACATCTTCCTAAGGGAGCCTTGTGTCCGGGAAAAACTGACAGACCAGTGATCTGGGTGCAGAAGGCTTGAGACAAAACTA BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ZA:Z:<@;33;;;1;;><&;32;;;1;101M;98C2> MD:Z:101 RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 NM:i:0 HISEQ:63:H82C9ADXX:1:1210:6887:65615 147 20 61000 32 101M = 60760 -340 TGCCATGCCCAGGATCCTCTGCCCTTGATCCTGAATCAACAGACCACTTGCAGATATACTTCACAGCCCACGCTGACTCTGCCAAGCACAGACAACCATTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFIIIIIIIIFFFFFFIIIIIIIIIIFFIIIIFIIIIIFFIIIIIFIIIIIIIIIIIIFFFFFFFFFFBBB ZA:Z:&;33;;;1;101M;101><@;32;;;1;; MD:Z:98C2 RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 NM:i:1 HWI-D00450:67:H8AHNADXX:2:2112:6252:95594 73 20 61055 27 101M = 61055 0 ATACTTCACAGCCCACGCTGACTCTGCCAAGCACAGACAACCATTGGGCCCCAGGGGAGCTGCAGGTCTCCTGGTCACCTAATCTTTTTTTTTTTTATACT BBBFFFFFFFFFFIIIIFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFF ZA:Z:<@;27;;;1;;><&;0;;;0;;> MD:Z:43C57 RG:Z:118-1-2A.140128_H8AHNADXX_GATCAG.lane2 NM:i:1 HWI-D00450:67:H8AHNADXX:2:2112:6252:95594 133 20 61055 0 * = 61055 0 GTTTGATAGTTTGCTGAGAATGATGGTTTCCAGCTTCATCCATGTCCCTGCAAAGGGCATGAACTCATCATTTTTTATGGCTGCATAGTATTCCATGGCAC BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFF ZA:Z:&;27;;;1;101M;43C57><@;0;;;0;; RG:Z:118-1-2A.140128_H8AHNADXX_GATCAG.lane2 HISEQ:63:H82C9ADXX:1:1215:15890:18485 101 20 61443 0 * = 61443 0 AACAGGCCCTGGTGCCATGGAATACTATGCAGCCATAAAAAATGATGAGTTCATGCCCTTTGCAGGGACATGGATGAAGCTGGAAACCATCATTCTCAGCA BBBFFFFFFFFFFFIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIFFIIFFIIIIIIIIIIIIIFFFFFFFFFFFBFBFFFFFFFFFFFFFFFFFFFFF ZA:Z:<@;0;;;0;;><&;27;;;1;101M;101> RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 HISEQ:63:H82C9ADXX:1:1215:15890:18485 153 20 61443 27 101M = 61443 0 TCACTCATAGGTGGGAATTGAATAATGAGAACACATGGACACAGGAAGGGGAACATCACACATCGGGACCTAATCTTAAGCTAAGTGTGGCTAAGAGCCTA FBBBFFFFFFFFFFFFFFFFFFFFFFFFIIFFFIIIIFFFFIIIIFIIIFIIIIIFFBFFIFFIIIFFIFFIIFIFFIIIIIIIIIIIFFFFFFFFFFBBB ZA:Z:&;0;;;0;;><@;27;;;1;; MD:Z:101 RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 NM:i:0
I see. Even though the first two reads are unmapped they have a coordinate listed and get put at the top of the sorted bam file, which as you say probably differs between mapping software. According to the SAM specification an unmapped read "may also have an ordinary coordinate such that it can be placed at a desired position after sorting" which I was not aware of.
Anyway, so yes I think it makes sense to only look at the first "n" reads, where I was thinking n=~100. Sound good?
Nice to understand the problem! Yes I guess 100 reads should be enough.
ok, would you like to add it :) and then I'll merge the PR? Should just need to add a simple counter.
Thanks for pointing this issue out!
Sure, I'll fix it tomorrow
Fixed this in the latest commit:)
looks good, just merged. thanks!