Open Hana288 opened 5 years ago
Ping @nathanhaigh
I've had a closer look into this and localised the issue further.
The Genbank file for Podoviridae (./ProphET_install_temp.dir/Podoviridae.dir/10744.gb
)
contains a locus_tag SPN9CC_0001
with multiple ranges for a feature:
FEATURES Location/Qualifiers
source 1..40128
/organism="Salmonella phage SPN9CC"
/mol_type="genomic DNA"
/host="Salmonella sp."
/db_xref="taxon:1127357"
gene complement(join(40126..40128,1..1455))
/gene="gtrC"
/locus_tag="SPN9CC_0001"
/db_xref="GeneID:12980303"
CDS complement(join(40126..40128,1..1455))
/gene="gtrC"
/locus_tag="SPN9CC_0001"
/note="GtrC"
/codon_start=1
/transl_table=11
/product="O-antigen conversion translocase"
/protein_id="YP_006383840.1"
/db_xref="GeneID:12980303"
/translation="MKFNSNDRIFISIFLGLAIIYTFPLLTHQSFFVDDLGRSLYGGL
GWSGNGRPLSDFIFYIINFGTPIIDASPLPLMLGIVILALALSCVREKLFGDDYITAS
LCFMMILANPFFIENLSYRYDSLTMCMSVAISIISSYVAYQYKPINIIISSILTIAFL
SLYQAALNTYAIFLLAFIISDVVKKNSISNITKNTASSVAGLIIGYFSYSYFIAKRLV
TGSYNIEHSKIIEINSSLFEGIISNVLSFYRMFSTILNGDNYLIYYSLFFALIISLIV
IVLKVIKRDENKKTKFLLVVLILLASMFFIIGPMIFLKSPIYAPRVLIGMGGFMFFCC
LCVFYAFEDKQLISRIYFSFILLISTIFSYGACNAINAQFQLEESIVNRISQDIDYLG
FGRDKKNIKFIGTEPYASINENIVIKHPLMRELIPRIINNNWMWSEVLMQRNVFSRNY
RLYDKEVKLENGWKKSGNNVYDIGVVGETIVVRFN"
The first range (40126..40128
) corresponds to 3 bases which actually end up being a STOP codon. The output file (./ProphET_install_temp.dir/Podoviridae.dir/10744.antisense
) generated by the call to extractfeat
from within UTILS.dir/retrieve_proteins.sh
is as follows:
>NC_017985_40126_40128 [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
tag
>NC_017985_1_1455 [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
gtgaaatttaatagtaatgacaggatatttatatcaatctttcttggattggcgattata
tatacatttcctttattgacacatcaatcatttttcgttgatgacttgggtaggtcttta
tatggcgggttgggttggtcaggcaatggtcgcccactttccgactttattttctatatc
attaattttggaaccccaattatagatgcttctccgctacctttaatgctagggatagtt
attttagcattggcactatcctgcgtcagggaaaagctgtttggagatgactacatcaca
gcatctctttgttttatgatgattttggcaaacccattctttattgaaaatctatcatat
agatatgattcattaacaatgtgcatgagtgtggcaatatctattatctcatcgtatgtc
gcttatcaatacaagcctataaatatcataatatcatccattttaaccattgcattcctt
agtctttatcaggctgcgctgaatacttacgcaatattcttgttggcctttataatttca
gatgtggttaagaaaaactcaatttcaaatatcacaaaaaatacagcatcttctgtcgct
ggtttaataataggatatttttcctattcttactttattgcaaaaagacttgtaacaggt
tcttacaatatcgaacatagtaagattatagagataaactcaagtttatttgaagggata
atttctaacgtcttatcattttatagaatgtttagcacgatcttgaatggcgataattac
ttaatctactactcgctattctttgcgctaatcatttctttgatagtcatagttttaaaa
gtaatcaaaagagatgaaaataagaaaacaaagttcttgctagtagttttaattttatta
gcatcaatgtttttcatcattggaccaatgatttttctaaaatcaccaatatacgcaccg
agggtattgattggtatgggtggctttatgtttttttgttgcctatgcgtattctatgct
tttgaagataagcagttaatatcaagaatatatttttcttttattcttttaatatcaaca
atattttcttatggtgcttgcaatgccataaatgcacagtttcagcttgaggaaagcatt
gtaaatagaatatctcaagacatagattatcttggatttggaagagacaagaaaaatata
aaattcattggcacagaaccgtatgcatcaataaatgaaaacatagtaataaagcatcct
ttaatgagagagttaataccacgcattattaacaataattggatgtggtcagaggtgtta
atgcaaagaaatgtgttctccagaaattacagactatatgacaaagaggtgaaacttgaa
aatgggtggaaaaaatctggtaataacgtatacgatattggtgttgtaggggaaaccata
gttgttaggtttaac
The conversion to protein by running transeq
from UTILS.dir/retrieve_proteins.sh
creates the following output:
>NC_017985_40126_40128_1 - [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
*
>NC_017985_1_1455_1 - [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
VKFNSNDRIFISIFLGLAIIYTFPLLTHQSFFVDDLGRSLYGGLGWSGNGRPLSDFIFYI
INFGTPIIDASPLPLMLGIVILALALSCVREKLFGDDYITASLCFMMILANPFFIENLSY
RYDSLTMCMSVAISIISSYVAYQYKPINIIISSILTIAFLSLYQAALNTYAIFLLAFIIS
DVVKKNSISNITKNTASSVAGLIIGYFSYSYFIAKRLVTGSYNIEHSKIIEINSSLFEGI
ISNVLSFYRMFSTILNGDNYLIYYSLFFALIISLIVIVLKVIKRDENKKTKFLLVVLILL
ASMFFIIGPMIFLKSPIYAPRVLIGMGGFMFFCCLCVFYAFEDKQLISRIYFSFILLIST
IFSYGACNAINAQFQLEESIVNRISQDIDYLGFGRDKKNIKFIGTEPYASINENIVIKHP
LMRELIPRIINNNWMWSEVLMQRNVFSRNYRLYDKEVKLENGWKKSGNNVYDIGVVGETI
VVRFN
This means that when STOP codons are removed, the output becomes corrupted FASTA sequence due to UTILS.dir/line2fasta
not being able to handle sequences of length zero.
I thus have 2 questions:
extractfeat
be using the -join
argument so as to create a sequence which is the product of possibly multiple ranges as in the above case. This would result in a single protein sequence for each CDS feature in the GenBank file, regardless of whether it is defined as a single range or multiple ranges.UTILS.dir/line2fasta
could be modified to allow zero length sequences, and then only print sequences if they are >0bp long:diff --git a/UTILS.dir/line2fasta b/UTILS.dir/line2fasta
index 60d4f49..029ef27 100755
--- a/UTILS.dir/line2fasta
+++ b/UTILS.dir/line2fasta
@@ -19,9 +19,9 @@ if( defined( $inputFile ) ){
while( <INFILE> ){
chomp;
- my ( $seq, $id ) = ( $_ =~ /([\w\W]+?)[\s\t]+([\w\W]+)/ );
+ my ( $seq, $id ) = ( $_ =~ /([\w\W]*?)[\s\t]+([\w\W]+)/ );
$seq =~ s/(\w{60})/$1\n/g;
- print ">$id\n$seq\n";
+ print ">$id\n$seq\n" if length($seq) > 0;
}
close INFILE;
@@ -30,9 +30,9 @@ if( defined( $inputFile ) ){
while( <> ){
chomp;
- my ( $seq, $id ) = ( $_ =~ /([\w\W]+?)[\s\t]+([\w\W]+)/ );
+ my ( $seq, $id ) = ( $_ =~ /([\w\W]*?)[\s\t]+([\w\W]+)/ );
$seq =~ s/(\w{60})/$1\n/g;
- print ">$id\n$seq\n";
+ print ">$id\n$seq\n" if length($seq) > 0;
}
}
I'd argue, the above patch should be used, and then a decision about whether -join
is used when extractfeat
is called needs to be had.
ping @gustavo11
While running
./INSTALL.pl
I get the following out:I believe the error is generated by BLAST while running the following command:
This is the output of running that command manually:
Looking at line
171195
ofPhage_proteins_raw.db
I see the following:As you can see, there is a malformed FASTA sequence in the middle. This seems to originate from the following file:
However, I am unable to trace this further back to see where things are going wrong.