tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
849 stars 226 forks source link

Add information about plasmid annotation #319

Open tseemann opened 6 years ago

tseemann commented 6 years ago

For plasmid you will not get a good result if you just use the default settings.

I would recommend getting GENBANK files (.gbk or .gb) of all the plasmids that are similar to your one.

Say you get three of them p1.gbk p2.gbk p3.gbk Then make a single genbank file: cat p1.gbk p2.gbk p3.gbk > plasmids.gbk Then run prokka with: --proteins plasmids.gbk

That will give a much better names for the proteins in your plasmid.

The next verison of Prokka will have a proper plasmid database included.

sagarutturkar commented 5 years ago

Great suggestion! I was able to get improved results after using the genbank files.

I used PLSDB to get the plasmid sequences of interest. PLSDB information might be useful for others.

tseemann commented 5 years ago

@sagarutturkar thanks for the tip about PLSDB!

tseemann commented 5 years ago

Turns out their are 1.1 million unique proteins in all refseq plasmids. Clustered down to about 250,000. That's way bigger than the 22,000 core chromosomal DB i am using!

Kirk3gaard commented 5 years ago

How big a fraction of these still has no known function?

tseemann commented 5 years ago

That's after I excluded hypotheticals. BUT It turns out that those stats are all wrong, and include lots of chromosomes. WHY? https://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/ has all the CDS of chromosomes in it too WTF!

edfadeev commented 5 years ago

Hi @tseemann, Perhaps this would help as well? A Curated, Comprehensive Database of Plasmid Sequences

ABSTRACT Plasmid sequences are central to a myriad of microbial functions and processes. Here, we have compiled a database of complete plasmid sequences and associated metadata curated from both NCBI’s recent genome database update, which includes plasmids as organisms, and all available annotated bacterial genomes. The resultant database contains 10,892 complete plasmid sequences and associated metadata.

tseemann commented 5 years ago

I need a database of non-redudant plasmid-specific proteins and corresponding /gene, /EC_number (and /COG if possible)

katdotfasta commented 4 years ago

Hi @tseemann,

I am attempting at reducing to a minimum the number of hypothetical proteins in my genomes. Some genomes are complete (all replicons are closed) while others are not.

1) For closed genomes, I use --proteins with .gbk files of either chromosomes or plasmids depending on what I am annotating (so each separately) 2) For draft genomes, I sometimes have a few closed replicons that are clearly plasmids (so I use --proteins with plasmid .gbk) but at other times I do not. How would you advice I proceed? I also used --prodigaltf [trained using prodigal -t on a closed genome]

Could you please weigh in on the approach I am taking? I also appreciate any advice that may help! Thanks and cheers, Kat

splaisan commented 2 years ago

That's after I excluded hypotheticals. BUT It turns out that those stats are all wrong, and include lots of chromosomes. WHY? https://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/ has all the CDS of chromosomes in it too WTF!

HI @tseemann, if I remember well my classes, plasmids are made of bacterial genes for a part. Could it be that? I just annotated my two natural plasmids using bacterial settings and it returned a number of ORFs among which known bacterial genes. What is may be missing are replication regions and other regulatory elements but at least the ORFs are there right?