psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 34 forks source link

IndexError: string index out of range #234

Closed jotwin closed 7 years ago

jotwin commented 7 years ago

I was using this script to extract CDR3 position and i got an error

import csv
import sys

csv.field_size_limit(sys.maxsize)

partis_path = '/home/jakub/source/partis'
sys.path.insert(1, partis_path + '/python')
import utils
import glutils

glfo = glutils.read_glfo(partis_path + '/data/germlines/human', locus='igh')
print("cyst_position, tryp_position")
with open(sys.argv[1]) as csvfile:
    reader = csv.DictReader(csvfile)
    for line in reader:
        utils.process_input_line(line)
        try:
          #utils.add_implicit_info(glfo, line, existing_implicit_keys=('aligned_d_seqs', 'aligned_j_seqs', 'aligned_v_seqs', 'cdr3_length', 'naive_seq', 'in_frames', 'mutated_invariants', 'stops', 'mut_freqs'))
          utils.add_implicit_info(glfo, line)
          cdr3_bounds = (line['codon_positions']['v'], line['codon_positions']['j'])
          print("%s, %s " % cdr3_bounds)
          #print line['unique_ids'], "\t", line['codon_positions']['v'], "\t", line['codon_positions']['j']
        except:
          print "NA, NA"
(lots of output)
46, 118 
46, 118 
57, 118 
65, 118 
46, 118 
84, 145 
62, 118 
79, 118 
49, 118 
52, 118 
52, 118 
Traceback (most recent call last):
  File "/home/jakub/Dropbox/coevolution/HIV_AB_Sequences/Pybus/scripts/getcdr3_frompartis.py", line 16, in <module>
    utils.process_input_line(line)
  File "/home/jakub/source/partis/python/utils.py", line 1370, in process_input_line
    info['seqs'] = [info['indel_reversed_seqs'][iseq] if info['indel_reversed_seqs'][iseq] != '' else info['input_seqs'][iseq] for iseq in range(len(info['unique_ids']))]  # if there's no indels, we just store 'input_seqs' and leave 'indel_reversed_seqs' empty
IndexError: string index out of range
psathyrella commented 7 years ago

hm, well, that certainly shouldn't happen. That bit of code is kind of complicated because it's handling backwards compatibility with several different file vintages. But either way, it looks like it's finding something unexpected in either the indel_reversed_seqs of input_seqs column of the file. What do those look like in the file? Is there any chance you modified them? If not, can you send the file?

I've also added a check on the dev branch here:

https://github.com/psathyrella/partis/commit/0b14db1fbca4f963b43f1542a5156b24cd9c38c8

that may give some clues if the other things don't work.

jotwin commented 7 years ago

the input file was a partis annotation csv, from the latest version. At the end of that file were some empty lines which caused the error. Is that the intended behavior of partis annotate?

example

D2-M_1849061_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_82001_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_3195351_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_201371_5,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_3784141_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_1553101_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_4342851_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_3762521_10,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_3622941_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_201161_5,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_4135321_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_3101581_6,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_1712191_11,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_733181_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_4015501_1,,,,,,,,,,,,,,,,,,,,,,,,,,
D2-M_3318991_1,,,,,,,,,,,,,,,,,,,,,,,,,,
psathyrella commented 7 years ago

oh, cool, that's easy to fix then. I'll deal and post.

Yes, that's what it does when it can't find an annotation. There should be some statistics in stdout as to how many sequences failed smith-waterman and hmm annotation. If you want more details about why those failed, run just on them with debug cranked up to 1 or 2, i.e. the basically same command line you run with before, but with something like:

./bin/partis run-viterbi --infname <yadda> --queries=D2-M_1849061_1:D2-M_82001_1:D2-M_3195351_1 --debug 1
psathyrella commented 7 years ago

oh, cool, that's ~easy~ kind of a cluster*#$($ to fix then.

almost done...

psathyrella commented 7 years ago

ok in retrospect I should have just told you that the example script should've had a line to skip failed annotations like so:

https://github.com/psathyrella/partis/commit/dcb728b6bbd0bbeb47b74bb6ea982c23c2c0c425#diff-6091bf930a34810278d242867095447eR18

but the bookkeeping on failed annotations needed to be fixed, which it hopefully is now (for instance queries that failed in smith waterman weren't getting written to the output file, and now the input sequences are written to the output file for failed queries).

I should add that I decided it made more sense to write failed queries to the output file as empty-ish lines, to make it clear that they failed and so they don't just disappear. But if this isn't the best way to do it let me know.