omnicoders / bio-geolocation

Looks up the location of sequences in GenBank and adds it to a FASTA file.
0 stars 1 forks source link

Punctuation characters in the name cause mangled output #6

Open AlanRockefeller opened 6 years ago

AlanRockefeller commented 6 years ago

The following record has odd characters in the sequence name, which cause the output file to be corrupt because there are two > brackets in a row instead of a > and a sequence like the FASTA format specifies.

https://www.ncbi.nlm.nih.gov/nuccore/MF140467.1

The output line looks like this:

MF140467.1 Coprinellus sp. Ireland: Gurraig :2282 internal transcribed spacer 1, partial sequence; 5.8 internal transcribed No Location Provided CCTGCGGAAGGATCATTAACGAATAACTATGGTGTCTTGGTTGTAGCTGGCTCCTCGGAGCATTGTGCACGCCCGCCATT

Correct output should be:

MF140467.1 Coprinellus sp. Ireland: Gurraig CCTGCGGAAGGATCATTAACGAATAACTATGGTGTCTTGGTTGTAGCTGGCTCCTCGGAGCATTGTGCACGCCCGCCATT