Open Adamtaranto opened 8 years ago
Yeah, came across this bug the other day. Planning to make two functions, one that scrapes just the GI to an int, and one that returns a dict for all pairs, silently accepting stupidity or format weridness, so long as there's at least one vaild pair.
I.e. expect the current bone-headed behaviour to change soon.
In the mean time here is a painful way to clean up sseqid:
sed -e 's/\(gi|[0-9]\+\).*\(|\t\)/\1\2/' myblast.tab | sed -e 's/|\t/\t/' | sed -e 's/[0-9]|[^\t]*//' >> clean_gi.tab
Dies on subject sequence ID string with odd number of id|value pairs.
Why do we care about anything beyond the gi number? The database and accession (second pair) might come in handy, but only really need gi to do taxid lookup.
Perhaps default action should be to store as many pairs as are present, and to skip any trailing values?