kdm9 / lpipy

LPI.py: Lineage Probablity Index calculation in Python
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Handling odd number of id|value pairs in sseqid string #3

Open Adamtaranto opened 8 years ago

Adamtaranto commented 8 years ago

Dies on subject sequence ID string with odd number of id|value pairs.

Why do we care about anything beyond the gi number? The database and accession (second pair) might come in handy, but only really need gi to do taxid lookup.

Perhaps default action should be to store as many pairs as are present, and to skip any trailing values?

Traceback (most recent call last):
  File "./lpipy/misc/demo.py", line 15, in <module>
    for hit in BlastFile(argv[1]):
  File "/usr/local/lib/python3.4/dist-packages/lpi-0_untagged.11.gd0f1d6a-py3.4.egg/lpi/blast.py", line 128, in __next__
    values[i] = BLAST_FIELDS[field](values[i])
  File "/usr/local/lib/python3.4/dist-packages/lpi-0_untagged.11.gd0f1d6a-py3.4.egg/lpi/blast.py", line 18, in __init__
    raise ValueError("Odd number of 'id|value' pairs", seqid_str)
ValueError: ("Odd number of 'id|value' pairs", 'gi|59800337|sp|Q7ZT42.1|SND1_DANRE')
kdm9 commented 8 years ago

Yeah, came across this bug the other day. Planning to make two functions, one that scrapes just the GI to an int, and one that returns a dict for all pairs, silently accepting stupidity or format weridness, so long as there's at least one vaild pair.

kdm9 commented 8 years ago

I.e. expect the current bone-headed behaviour to change soon.

Adamtaranto commented 8 years ago

In the mean time here is a painful way to clean up sseqid:

sed -e 's/\(gi|[0-9]\+\).*\(|\t\)/\1\2/' myblast.tab | sed -e 's/|\t/\t/' | sed -e 's/[0-9]|[^\t]*//' >> clean_gi.tab