mgymrek / pybamview

Browser based application for viewing bam alignments
MIT License
56 stars 17 forks source link

Develop #42

Closed moonso closed 9 years ago

moonso commented 9 years ago
mgymrek commented 9 years ago

Thanks for this PR.

Regarding setting the default start position, this was originally discussed issue #27. My thought then was that we only need to look at the first read in a BAM file, since the reads need to be sorted anyway and so if there are any unaligned reads they should be at the end of the file. With the current change, one could end up examining every read in a potentially very large BAM file to set the default location, which could take a very long time. Thoughts?

moonso commented 9 years ago

Ok, sorry I haden't seen that discussion. In one of my test cases this solution solved the problem with spotting the start position.

On might put a limit on the number of reads to go through and if a mapped read is not found among those the program exits?

mgymrek commented 9 years ago

For your test case, was the bam file sorted? Was it perhaps sorted by name rather than coordinate?

We could limit the number of reads to go through as you suggest, but first I'd like to understand why your example didn't work with the current method.

moonso commented 9 years ago

Ok I've checked this now. The first two reads in my file have 0 in mapping quality and is flagged as unmapped. I guess that the reason for them to not be in the end of the file is that their mates are mapped, but I'm not 100%. Anyway there might be different explanations to this that is also dependent on what software that performed the mapping. So maybe it's a good thing to look further into the bam than only the first read?

mgymrek commented 9 years ago

Was your BAM file indexed and sorted by coordinate? Could you possibly paste the SAM file with the first ~5 reads here? Anyway, if this is the case, I agree it could make sense to check the first n reads where n is something small.

moonso commented 9 years ago

HISEQ:63:H82C9ADXX:1:2215:2626:72342 101 20 60076 0 * = 60076 0 TCCTACTGAGAGAAACCAAAATATTGAGTAAACTATCCCACTTTGAACAGAATTTTTAAGAGAAAAAACTGAAAGTTAATAGAGAGGTGACTCAGATCCAG BBBFFFFFFFFFFIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFF<BBFFFFFFFFFFF ZA:Z:<@;0;;;0;;><&;27;;;1;101M;101> RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 HWI-D00450:67:H8AHNADXX:1:1111:11290:19503 101 20 60076 0 * = 60076 0 TCCTACTGAGAGAAACCAAAATATTGAGTAAACTATCCCACTTTGAACAGAATTTTTAAGAGAAAAAACTGAAAGTTAATAGAGAGGTGACTCAGATCCAG BBBFFFFFFFFFFIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFF ZA:Z:<@;0;;;0;;><&;27;;;1;101M;101> RG:Z:118-1-2A.140128_H8AHNADXX_GATCAG.lane1 HISEQ:63:H82C9ADXX:1:2215:2626:72342 153 20 60076 27 101M = 60076 0 AACAGGTGGTAAGGAAGGAGAGAGTGAAGGAACTGCCAGGTGACACACTCCCACCATGGACCTCTGGGATCCTAGCTTTAAGAGATCCCATCACCCACATG FFFFFFFFFFFFBBFFFFFFFFFFFFFFFFFBFFFFIIIIIIFFFBBIIIIFBIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFBBB ZA:Z:&;0;;;0;;><@;27;;;1;; MD:Z:101 RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 NM:i:0 HWI-D00450:67:H8AHNADXX:1:1111:11290:19503 1177 20 60076 27 101M = 60076 0 AACAGGTGGTAAGGAAGGAGAGAGTGAAGGAACTGCCAGGTGACACACTCCCACCATGGACCTCTGGGATCCTAGCTTTAAGAGATCCCATCACCCACATG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFIIIIIIIIFFBIIIIFFIIIIIFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFBBB ZA:Z:&;0;;;0;;><@;27;;;1;; MD:Z:101 RG:Z:118-1-2A.140128_H8AHNADXX_GATCAG.lane1 NM:i:0 HISEQ:63:H82C9ADXX:1:1210:6887:65615 99 20 60760 33 101M = 61000 340 GTAACCTGTTTGTCAGCCACAACATCTTCCTAAGGGAGCCTTGTGTCCGGGAAAAACTGACAGACCAGTGATCTGGGTGCAGAAGGCTTGAGACAAAACTA BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ZA:Z:<@;33;;;1;;><&;32;;;1;101M;98C2> MD:Z:101 RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 NM:i:0 HISEQ:63:H82C9ADXX:1:1210:6887:65615 147 20 61000 32 101M = 60760 -340 TGCCATGCCCAGGATCCTCTGCCCTTGATCCTGAATCAACAGACCACTTGCAGATATACTTCACAGCCCACGCTGACTCTGCCAAGCACAGACAACCATTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFIIIIIIIIFFFFFFIIIIIIIIIIFFIIIIFIIIIIFFIIIIIFIIIIIIIIIIIIFFFFFFFFFFBBB ZA:Z:&;33;;;1;101M;101><@;32;;;1;; MD:Z:98C2 RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 NM:i:1 HWI-D00450:67:H8AHNADXX:2:2112:6252:95594 73 20 61055 27 101M = 61055 0 ATACTTCACAGCCCACGCTGACTCTGCCAAGCACAGACAACCATTGGGCCCCAGGGGAGCTGCAGGTCTCCTGGTCACCTAATCTTTTTTTTTTTTATACT BBBFFFFFFFFFFIIIIFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFF ZA:Z:<@;27;;;1;;><&;0;;;0;;> MD:Z:43C57 RG:Z:118-1-2A.140128_H8AHNADXX_GATCAG.lane2 NM:i:1 HWI-D00450:67:H8AHNADXX:2:2112:6252:95594 133 20 61055 0 * = 61055 0 GTTTGATAGTTTGCTGAGAATGATGGTTTCCAGCTTCATCCATGTCCCTGCAAAGGGCATGAACTCATCATTTTTTATGGCTGCATAGTATTCCATGGCAC BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFF ZA:Z:&;27;;;1;101M;43C57><@;0;;;0;; RG:Z:118-1-2A.140128_H8AHNADXX_GATCAG.lane2 HISEQ:63:H82C9ADXX:1:1215:15890:18485 101 20 61443 0 * = 61443 0 AACAGGCCCTGGTGCCATGGAATACTATGCAGCCATAAAAAATGATGAGTTCATGCCCTTTGCAGGGACATGGATGAAGCTGGAAACCATCATTCTCAGCA BBBFFFFFFFFFFFIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIFFIIFFIIIIIIIIIIIIIFFFFFFFFFFFBFBFFFFFFFFFFFFFFFFFFFFF ZA:Z:<@;0;;;0;;><&;27;;;1;101M;101> RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 HISEQ:63:H82C9ADXX:1:1215:15890:18485 153 20 61443 27 101M = 61443 0 TCACTCATAGGTGGGAATTGAATAATGAGAACACATGGACACAGGAAGGGGAACATCACACATCGGGACCTAATCTTAAGCTAAGTGTGGCTAAGAGCCTA FBBBFFFFFFFFFFFFFFFFFFFFFFFFIIFFFIIIIFFFFIIIIFIIIFIIIIIFFBFFIFFIIIFFIFFIIFIFFIIIIIIIIIIIFFFFFFFFFFBBB ZA:Z:&;0;;;0;;><@;27;;;1;; MD:Z:101 RG:Z:118-1-2A.140113_H82C9ADXX_GATCAG.lane1 NM:i:0

mgymrek commented 9 years ago

I see. Even though the first two reads are unmapped they have a coordinate listed and get put at the top of the sorted bam file, which as you say probably differs between mapping software. According to the SAM specification an unmapped read "may also have an ordinary coordinate such that it can be placed at a desired position after sorting" which I was not aware of.

Anyway, so yes I think it makes sense to only look at the first "n" reads, where I was thinking n=~100. Sound good?

moonso commented 9 years ago

Nice to understand the problem! Yes I guess 100 reads should be enough.

mgymrek commented 9 years ago

ok, would you like to add it :) and then I'll merge the PR? Should just need to add a simple counter.

Thanks for pointing this issue out!

moonso commented 9 years ago

Sure, I'll fix it tomorrow

moonso commented 9 years ago

Fixed this in the latest commit:)

mgymrek commented 9 years ago

looks good, just merged. thanks!