jaumlrc / ProphET

10 stars 10 forks source link

Corrupt FASTA File during install #34

Open Hana288 opened 5 years ago

Hana288 commented 5 years ago

While running ./INSTALL.pl I get the following out:

Formating sequences ...
BLAST Database creation error: Near line 171195, there's a line that doesn't look like plausible data, but it's not marked as defline or comment.
Program failed, try executing the command manually.
Removing ABC-Transporters ...
BLAST Database error: No alias or index file found for protein database [Phage_proteins_raw.db] in search path [/home/hana/Programs/PropET/ProphET_install_temp.dir::]
Program failed, try executing the command manually.
ERROR: Incomplete or corrupted BLAST search for ABC transporters!!

I believe the error is generated by BLAST while running the following command:

`formatdb -p T -i Phage_proteins_raw.db`;

This is the output of running that command manually:

Building a new DB, current time: 10/17/2018 13:50:13
New DB name:   /home/hana/Programs/PropET/ProphET_install_temp.dir/Phage_proteins_raw.db
New DB title:  Phage_proteins_raw.db
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B

volume: /home/hana/Programs/PropET/ProphET_install_temp.dir/Phage_proteins_raw.db

file: /home/hana/Programs/PropET/ProphET_install_temp.dir/Phage_proteins_raw.db.pin
file: /home/hana/Programs/PropET/ProphET_install_temp.dir/Phage_proteins_raw.db.phr
file: /home/hana/Programs/PropET/ProphET_install_temp.dir/Phage_proteins_raw.db.psq

BLAST Database creation error: Near line 171195, there's a line that doesn't look like plausible data, but it's not marked as defline or comment.
Program failed, try executing the command manually.

Looking at line 171195 of Phage_proteins_raw.db I see the following:

>NC_018083_17439_17615_1 - [CDS] (locus_tag="PHICPV4_gp27", product="hypothetical protein", protein_id="YP_006488642.1") Clostridium phage phiCPV4, complete genome.
MNDYVKNELWKLEVNNHILIDYVAQYVDNELTRCLLLQNLNLQGMIINNLQVEVLKNE
>NC_018083_17626_17775_1 - [CDS] (locus_tag="PHICPV4_gp28", product="hypothetical protein", protein_id="YP_006488643.1") Clostridium phage phiCPV4, complete genome.
MKTLIKLLAIIPITILGIILAGLVLIGACLFIVALPLMIAIALLISLYE
>- [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
    NC_017985_40126_40128_1
>NC_017985_1_1455_1 - [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
VKFNSNDRIFISIFLGLAIIYTFPLLTHQSFFVDDLGRSLYGGLGWSGNGRPLSDFIFYI
INFGTPIIDASPLPLMLGIVILALALSCVREKLFGDDYITASLCFMMILANPFFIENLSY
RYDSLTMCMSVAISIISSYVAYQYKPINIIISSILTIAFLSLYQAALNTYAIFLLAFIIS
DVVKKNSISNITKNTASSVAGLIIGYFSYSYFIAKRLVTGSYNIEHSKIIEINSSLFEGI

As you can see, there is a malformed FASTA sequence in the middle. This seems to originate from the following file:

fgrep -C5 -w NC_017985_40126_40128_1 Podoviridae.dir/all.prot.fas
>NC_018083_17439_17615_1 - [CDS] (locus_tag="PHICPV4_gp27", product="hypothetical protein", protein_id="YP_006488642.1") Clostridium phage phiCPV4, complete genome.
MNDYVKNELWKLEVNNHILIDYVAQYVDNELTRCLLLQNLNLQGMIINNLQVEVLKNE
>NC_018083_17626_17775_1 - [CDS] (locus_tag="PHICPV4_gp28", product="hypothetical protein", protein_id="YP_006488643.1") Clostridium phage phiCPV4, complete genome.
MKTLIKLLAIIPITILGIILAGLVLIGACLFIVALPLMIAIALLISLYE
>- [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
    NC_017985_40126_40128_1
>NC_017985_1_1455_1 - [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
VKFNSNDRIFISIFLGLAIIYTFPLLTHQSFFVDDLGRSLYGGLGWSGNGRPLSDFIFYI
INFGTPIIDASPLPLMLGIVILALALSCVREKLFGDDYITASLCFMMILANPFFIENLSY
RYDSLTMCMSVAISIISSYVAYQYKPINIIISSILTIAFLSLYQAALNTYAIFLLAFIIS
DVVKKNSISNITKNTASSVAGLIIGYFSYSYFIAKRLVTGSYNIEHSKIIEINSSLFEGI

However, I am unable to trace this further back to see where things are going wrong.

Hana288 commented 5 years ago

Ping @nathanhaigh

nathanhaigh commented 5 years ago

I've had a closer look into this and localised the issue further.

The Genbank file for Podoviridae (./ProphET_install_temp.dir/Podoviridae.dir/10744.gb) contains a locus_tag SPN9CC_0001 with multiple ranges for a feature:

FEATURES             Location/Qualifiers
     source          1..40128
                     /organism="Salmonella phage SPN9CC"
                     /mol_type="genomic DNA"
                     /host="Salmonella sp."
                     /db_xref="taxon:1127357"
     gene            complement(join(40126..40128,1..1455))
                     /gene="gtrC"
                     /locus_tag="SPN9CC_0001"
                     /db_xref="GeneID:12980303"
     CDS             complement(join(40126..40128,1..1455))
                     /gene="gtrC"
                     /locus_tag="SPN9CC_0001"
                     /note="GtrC"
                     /codon_start=1
                     /transl_table=11
                     /product="O-antigen conversion translocase"
                     /protein_id="YP_006383840.1"
                     /db_xref="GeneID:12980303"
                     /translation="MKFNSNDRIFISIFLGLAIIYTFPLLTHQSFFVDDLGRSLYGGL
                     GWSGNGRPLSDFIFYIINFGTPIIDASPLPLMLGIVILALALSCVREKLFGDDYITAS
                     LCFMMILANPFFIENLSYRYDSLTMCMSVAISIISSYVAYQYKPINIIISSILTIAFL
                     SLYQAALNTYAIFLLAFIISDVVKKNSISNITKNTASSVAGLIIGYFSYSYFIAKRLV
                     TGSYNIEHSKIIEINSSLFEGIISNVLSFYRMFSTILNGDNYLIYYSLFFALIISLIV
                     IVLKVIKRDENKKTKFLLVVLILLASMFFIIGPMIFLKSPIYAPRVLIGMGGFMFFCC
                     LCVFYAFEDKQLISRIYFSFILLISTIFSYGACNAINAQFQLEESIVNRISQDIDYLG
                     FGRDKKNIKFIGTEPYASINENIVIKHPLMRELIPRIINNNWMWSEVLMQRNVFSRNY
                     RLYDKEVKLENGWKKSGNNVYDIGVVGETIVVRFN"

The first range (40126..40128) corresponds to 3 bases which actually end up being a STOP codon. The output file (./ProphET_install_temp.dir/Podoviridae.dir/10744.antisense) generated by the call to extractfeat from within UTILS.dir/retrieve_proteins.sh is as follows:

>NC_017985_40126_40128 [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
tag
>NC_017985_1_1455 [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
gtgaaatttaatagtaatgacaggatatttatatcaatctttcttggattggcgattata
tatacatttcctttattgacacatcaatcatttttcgttgatgacttgggtaggtcttta
tatggcgggttgggttggtcaggcaatggtcgcccactttccgactttattttctatatc
attaattttggaaccccaattatagatgcttctccgctacctttaatgctagggatagtt
attttagcattggcactatcctgcgtcagggaaaagctgtttggagatgactacatcaca
gcatctctttgttttatgatgattttggcaaacccattctttattgaaaatctatcatat
agatatgattcattaacaatgtgcatgagtgtggcaatatctattatctcatcgtatgtc
gcttatcaatacaagcctataaatatcataatatcatccattttaaccattgcattcctt
agtctttatcaggctgcgctgaatacttacgcaatattcttgttggcctttataatttca
gatgtggttaagaaaaactcaatttcaaatatcacaaaaaatacagcatcttctgtcgct
ggtttaataataggatatttttcctattcttactttattgcaaaaagacttgtaacaggt
tcttacaatatcgaacatagtaagattatagagataaactcaagtttatttgaagggata
atttctaacgtcttatcattttatagaatgtttagcacgatcttgaatggcgataattac
ttaatctactactcgctattctttgcgctaatcatttctttgatagtcatagttttaaaa
gtaatcaaaagagatgaaaataagaaaacaaagttcttgctagtagttttaattttatta
gcatcaatgtttttcatcattggaccaatgatttttctaaaatcaccaatatacgcaccg
agggtattgattggtatgggtggctttatgtttttttgttgcctatgcgtattctatgct
tttgaagataagcagttaatatcaagaatatatttttcttttattcttttaatatcaaca
atattttcttatggtgcttgcaatgccataaatgcacagtttcagcttgaggaaagcatt
gtaaatagaatatctcaagacatagattatcttggatttggaagagacaagaaaaatata
aaattcattggcacagaaccgtatgcatcaataaatgaaaacatagtaataaagcatcct
ttaatgagagagttaataccacgcattattaacaataattggatgtggtcagaggtgtta
atgcaaagaaatgtgttctccagaaattacagactatatgacaaagaggtgaaacttgaa
aatgggtggaaaaaatctggtaataacgtatacgatattggtgttgtaggggaaaccata
gttgttaggtttaac

The conversion to protein by running transeq from UTILS.dir/retrieve_proteins.sh creates the following output:

>NC_017985_40126_40128_1 - [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
*
>NC_017985_1_1455_1 - [CDS] (locus_tag="SPN9CC_0001", product="O-antigen conversion translocase", protein_id="YP_006383840.1") Salmonella phage SPN9CC, complete genome.
VKFNSNDRIFISIFLGLAIIYTFPLLTHQSFFVDDLGRSLYGGLGWSGNGRPLSDFIFYI
INFGTPIIDASPLPLMLGIVILALALSCVREKLFGDDYITASLCFMMILANPFFIENLSY
RYDSLTMCMSVAISIISSYVAYQYKPINIIISSILTIAFLSLYQAALNTYAIFLLAFIIS
DVVKKNSISNITKNTASSVAGLIIGYFSYSYFIAKRLVTGSYNIEHSKIIEINSSLFEGI
ISNVLSFYRMFSTILNGDNYLIYYSLFFALIISLIVIVLKVIKRDENKKTKFLLVVLILL
ASMFFIIGPMIFLKSPIYAPRVLIGMGGFMFFCCLCVFYAFEDKQLISRIYFSFILLIST
IFSYGACNAINAQFQLEESIVNRISQDIDYLGFGRDKKNIKFIGTEPYASINENIVIKHP
LMRELIPRIINNNWMWSEVLMQRNVFSRNYRLYDKEVKLENGWKKSGNNVYDIGVVGETI
VVRFN

This means that when STOP codons are removed, the output becomes corrupted FASTA sequence due to UTILS.dir/line2fasta not being able to handle sequences of length zero.

I thus have 2 questions:

  1. Should calls to EMBOSS' extractfeat be using the -join argument so as to create a sequence which is the product of possibly multiple ranges as in the above case. This would result in a single protein sequence for each CDS feature in the GenBank file, regardless of whether it is defined as a single range or multiple ranges.
  2. The regexp used in UTILS.dir/line2fasta could be modified to allow zero length sequences, and then only print sequences if they are >0bp long:
diff --git a/UTILS.dir/line2fasta b/UTILS.dir/line2fasta
index 60d4f49..029ef27 100755
--- a/UTILS.dir/line2fasta
+++ b/UTILS.dir/line2fasta
@@ -19,9 +19,9 @@ if( defined( $inputFile ) ){

   while( <INFILE> ){
        chomp;
-    my ( $seq, $id ) = ( $_ =~ /([\w\W]+?)[\s\t]+([\w\W]+)/ );
+    my ( $seq, $id ) = ( $_ =~ /([\w\W]*?)[\s\t]+([\w\W]+)/ );
     $seq =~ s/(\w{60})/$1\n/g;
-    print ">$id\n$seq\n";
+    print ">$id\n$seq\n" if length($seq) > 0;
   }

   close INFILE;
@@ -30,9 +30,9 @@ if( defined( $inputFile ) ){

   while( <> ){
     chomp;
-    my ( $seq, $id ) = ( $_ =~ /([\w\W]+?)[\s\t]+([\w\W]+)/ );
+    my ( $seq, $id ) = ( $_ =~ /([\w\W]*?)[\s\t]+([\w\W]+)/ );
     $seq =~ s/(\w{60})/$1\n/g;
-    print ">$id\n$seq\n";
+    print ">$id\n$seq\n" if length($seq) > 0;
   }
 }

I'd argue, the above patch should be used, and then a decision about whether -join is used when extractfeat is called needs to be had.

nathanhaigh commented 5 years ago

ping @gustavo11