tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
833 stars 226 forks source link

prokka-genbank_to_fasta_db error #494

Open ramadatta opened 4 years ago

ramadatta commented 4 years ago

Hi Seeman,

I downloaded few plasmids sequences in Genbank full format from NCBI for plasmid annotation and ran the following command:

prokka ENT2_Contig6_len_41186_circ_NDM-1_Plasmid.fasta --outdir PROKKA_05222020_with_AllNDM_plasmids --proteins All_NDM_Assigned_Plasmids_byMash_Plsdb.gb

However, I am getting the following error:

[20:53:55] Could not run command: prokka-genbank_to_fasta_db --format genbank All_NDM_Assigned_Plasmids_byMash_Plsdb\.gb > PROKKA_05222020_with_AllNDM_plasmids\/proteins\.faa 2> /dev/null

I ran the above command separately and that found the problem is here at Feature 22 of the Plasmid: MN657242

[20] MN657242 | QHW09233.1 | Mobile element protein
    Using specified /transl_table=11
[21] MN657242 | QHW09234.1 | Mobile element protein
    Using specified /transl_table=11
[22] MN657242 | QHW09235.1 | Mercuric transport protein, MerT
    Using specified /transl_table=11

Feature #22 does not have any of these tags: protein_id locus_tag db_xref at /home/user/anaconda3/bin/prokka-genbank_to_fasta_db line 57, <> line 63736.

Could I request your help to overcome this? Thanks.

P.S: I am using prokka 1.14.6.

tseemann commented 4 years ago

Thanks for the report! It's strange - the merR CDS does not have a protein ID.

It's because it's a pseudo-gene, but it is not using the /psuedo tag but a maybe new /pseudogene="" tag I have not seen before!

https://www.ncbi.nlm.nih.gov/nuccore/MN657242

     CDS             58826..59308
                     /gene="merR"
                     /note="pR148AM_129; merR"
                     /pseudogene="unknown"
                     /codon_start=1
                     /transl_table=11
                     /product="Hg(II)-responsive transcriptional regulator
                     MerR"
tseemann commented 4 years ago

Looks like it is new: http://www.insdc.org/documents/feature_table.html

The confusion is that it is missing the /pseudo tag to go along with it?

Qualifier       /pseudo
Definition      indicates that this feature is a non-functional version of the
                element named by the feature key
Value format    none
Example         /pseudo
Comment         The qualifier /pseudo should be used to describe non-functional 
                genes that are not formally described as pseudogenes, e.g. CDS 
                has no translation due to other reasons than pseudogenisation events.
                Other reasons may include sequencing or assembly errors.
                In order to annotate pseudogenes the qualifier /pseudogene= must be
                used indicating the TYPE which can be taken from the INSDC controlled vocabulary 
                for pseudogenes.

Qualifier       /pseudogene=
Definition      indicates that this feature is a pseudogene of the element named
                by the feature key
Value format    "TYPE"
                where TYPE is one of the following:
                processed, unprocessed, unitary, allelic, unknown

Example         /pseudogene="processed"
                /pseudogene="unprocessed"
                /pseudogene="unitary"
                /pseudogene="allelic"
                /pseudogene="unknown"

Comment         TYPE is a term taken from the INSDC controlled vocabulary for pseudogenes
                (http://www.insdc.org/documents/pseudogene-qualifier-vocabulary):

                processed: the pseudogene has arisen by reverse transcription of a 
                mRNA into cDNA, followed by reintegration into the genome. Therefore,
                it has lost any intron/exon structure, and it might have a pseudo-polyA-tail.

                unprocessed: the pseudogene has arisen from a copy of the parent gene by duplication
                followed by accumulation of random mutations. The changes, compared to their
                functional homolog, include insertions, deletions, premature stop codons, frameshifts
                and a higher proportion of non-synonymous versus synonymous substitutions.

                unitary: the pseudogene has no parent. It is the original gene, which is
                functional is some species but disrupted in some way (indels, mutation, 
                recombination) in another species or strain.

                allelic: a (unitary) pseudogene that is stable in the population but
                importantly it has a functional alternative allele also in the population. i.e.,
                one strain may have the gene, another strain may have the pseudogene.
                MHC haplotypes have allelic pseudogenes.

                unknown: the submitter does not know the method of pseudogenisation.
ramadatta commented 4 years ago

Thanks so much for quick reply @tseemann.

I understand the problem now. For the time being, I got rid off MN657242 plasmid from the gbk database and ran prokka without much of a problem. Please advice us if there is a fix to this if wanted to include the MN657242 plasmid sequence. Thanks much in advance!

tseemann commented 4 years ago

For now you can edit the GBK file and change /pseudogene="unknown" to /pseudo

ramadatta commented 4 years ago

@tseemann noted. Thank you!

D415yAPHA commented 2 years ago

Hi Dr Seeman,

I am experiencing the same issue, but cannot find the problematic tag. I ran:

prokka sample.fasta --proteins db/plasmids.gbk --outdir ./prokka --prefix sample

Obtained the same error as above: Could not run command: prokka-genbank_to_fasta_db --format genbank /prokka\/proteins.faa 2> /dev/null

Then ran /path/to/plasmids/gbks/db.gbk > prokka\/proteins.faa 2> /dev/null

and there was no output. My proteins.faa file is now blank. Is there any way I can check what is causing the error?

Thank you, Daisy

valery-shap commented 2 years ago

Hello, I had the same issue. Changing /pseudogene="unknown"to /pseudo helped. Valery

Pepaflores56 commented 1 year ago

Hi, I have the same problem "_Could not run command: prokka-genbank_to_fastadb". I checked the reference gbk file I am using to see if I have "/psesudogene="unknown", but it is not there. I have loaded the gbk file I am using in case you can see what I cannot. Could you kindly help me with this, please? Dania cluster_a.zip

jiaojiaoguan commented 1 year ago

Dear all:

I also have the same question Could not run command: prokka-genbank_to_fasta_db --format genbank /prokka/proteins.faa 2> /dev/null

and I have no /psesudogene="unknown" in my genebank file. I guess that maybe there are some weird characters which Prokka can not support. Therefore I find the source code and change prokka-genbank_to_fasta_db --format genbank All_NDM_Assigned_Plasmids_byMash_Plsdb\.gb > PROKKA_05222020_with_AllNDM_plasmids\/proteins\.faa 2> /dev/null to prokka-genbank_to_fasta_db --format genbank All_NDM_Assigned_Plasmids_byMash_Plsdb\.gb > PROKKA_05222020_with_AllNDM_plasmids\/proteins\.faa 2. Then when I run prokka again, it will output which line has an error. We can remove the record. It will work.

Thanks!

jiaojiaoguan commented 1 year ago

Dear all:

I also have the same question Could not run command: prokka-genbank_to_fasta_db --format genbank /prokka/proteins.faa 2> /dev/null

and I have no /psesudogene="unknown" in my genebank file. I guess that maybe there are some weird characters which Prokka can not support. Therefore I find the source code and change prokka-genbank_to_fasta_db --format genbank All_NDM_Assigned_Plasmids_byMash_Plsdb\.gb > PROKKA_05222020_with_AllNDM_plasmids\/proteins\.faa 2> /dev/null to prokka-genbank_to_fasta_db --format genbank All_NDM_Assigned_Plasmids_byMash_Plsdb\.gb > PROKKA_05222020_with_AllNDM_plasmids\/proteins\.faa 2. Then when I run prokka again, it will output which line has an error. We can remove the record. It will work.

Thanks!

The error is like this: Feature #1 does not have any of these tags: protein_id locus_tag db_xref at /xxxx/bin/prokka-genbank_to_fasta_db line 57, <> line 62075908.

and I find that it has no "locus_tag" in some sequences. Therefore I remove them.