sabrinadeltoro / pygr

Automatically exported from code.google.com/p/pygr
0 stars 0 forks source link

TBLASTN parsing error #41

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Bug in BLAST parser - from a git checkout on September 15th 2008                

Background:

In order to get around a bug in tblastn the start position of the sequence
for the subject line is set to the start position of the query above it.        

However if the first base of the query line is Q, then the find matches the
Q of Query: and it stores 0 as the offset of the start of the subject
sequence. See the last line of the result below plus traceback:                 

>ref|NC_007503.1| Carboxydothermus hydrogenoformans Z-2901, complete genome
          Length = 2401520                                                 

 Score = 90.5 bits (223), Expect = 4e-17,   Method: Compositional matrix
adjust.
 Identities = 53/181 (29%), Positives = 97/181 (53%), Gaps = 12/181 (6%)
 Frame = +2

Query: 13    GKVLWQNLTFTISAGERVGIHAPSGTGKTTLGRVLAGWQKPTAGDVLLDGSPFPLHQYCP
72
             G+V+   +TFT+  G+ +G+  PSG GK++L R+L     PT+G++   G    + +Y P
Sbjct: 99509 GQVILDGITFTVEEGDFLGVLGPSGAGKSSLFRLLNRLLSPTSGEIYYRGK--NIKEYDP
99682

Query: 73    VQLVPQHPELTFNPWRSAGDAVRD--------AWQPDPETLRRL----HVQPEWLTRRPM
120
             ++L  +   +   P+      + D          +PD E + +     +++ E L ++P
Sbjct: 99683 IKLRREIGYVLQRPYLFGQKVLEDLTYPFRIRQEKPDMELIYKYLAQANLKEEILAKKPT
99862

Query: 121   QLSGGELARIAILRALDPRTRFLIADEMTAQLDPSIQKAIWVYVLEVCRSRSLGMLVISH
180
             +LSGGE  RI+++R L  + R L+ DE+T+ LD    +AI   +L+    ++L +L I+H
Sbjct: 99863 ELSGGEAQRISLIRTLLVQPRVLLLDEVTSALDLDTTRAILDLILKEKEEKNLTVLAITH
100042

Query: 181   Q 181

Sbjct: 100043N 100045

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/pygr/parse_blast.py", line 174, in
<module>
    for t in p.parse_file(sys.stdin):
  File "/usr/lib/python2.5/site-packages/pygr/parse_blast.py", line 169, in
parse_file
    self.save_subject_line(line)
  File "/usr/lib/python2.5/site-packages/pygr/parse_blast.py", line 80, in
save_subject_line
    self.subject_end=int(c[3])
ValueError: invalid literal for int() with base 10: '100043N'

Possible Bugfix/workaround:

Line 70 in parse_blast.py

currently:
        self.seq_start_char=line.find(c[2]) # IN CASE BLAST SCREWS UP
Sbjct:

could be:
        self.seq_start_char=line[1:].find(c[2])+1 # IN CASE BLAST SCREWS UP
Sbjct: - only search from second character to avoid matches against Q of
Query:

Original issue reported on code.google.com by fishfrog...@gmail.com on 22 Sep 2008 at 11:48

GoogleCodeExporter commented 9 years ago
Thanks for the fix!  I modified it slightly to take advantage of find(q, start)
feature provided by the standard library.

Original comment by cjlee...@gmail.com on 27 Sep 2008 at 12:29

GoogleCodeExporter commented 9 years ago

Original comment by mare...@gmail.com on 21 Feb 2009 at 2:05

GoogleCodeExporter commented 9 years ago
We're closing this issue; the original file that reproduces the bug is included 
in
the blast_test.py automated tests.  Please re-open if you think there are still 
problems.

Original comment by cjlee...@gmail.com on 4 Mar 2009 at 11:59