Open tseemann opened 6 years ago
Great suggestion! I was able to get improved results after using the genbank files.
I used PLSDB to get the plasmid sequences of interest. PLSDB information might be useful for others.
@sagarutturkar thanks for the tip about PLSDB!
Turns out their are 1.1 million unique proteins in all refseq plasmids. Clustered down to about 250,000. That's way bigger than the 22,000 core chromosomal DB i am using!
How big a fraction of these still has no known function?
That's after I excluded hypotheticals. BUT It turns out that those stats are all wrong, and include lots of chromosomes. WHY? https://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/ has all the CDS of chromosomes in it too WTF!
Hi @tseemann, Perhaps this would help as well? A Curated, Comprehensive Database of Plasmid Sequences
ABSTRACT Plasmid sequences are central to a myriad of microbial functions and processes. Here, we have compiled a database of complete plasmid sequences and associated metadata curated from both NCBI’s recent genome database update, which includes plasmids as organisms, and all available annotated bacterial genomes. The resultant database contains 10,892 complete plasmid sequences and associated metadata.
I need a database of non-redudant plasmid-specific proteins and corresponding /gene
, /EC_number
(and /COG
if possible)
Hi @tseemann,
I am attempting at reducing to a minimum the number of hypothetical proteins in my genomes. Some genomes are complete (all replicons are closed) while others are not.
1) For closed genomes, I use --proteins with .gbk files of either chromosomes or plasmids depending on what I am annotating (so each separately) 2) For draft genomes, I sometimes have a few closed replicons that are clearly plasmids (so I use --proteins with plasmid .gbk) but at other times I do not. How would you advice I proceed? I also used --prodigaltf [trained using prodigal -t on a closed genome]
Could you please weigh in on the approach I am taking? I also appreciate any advice that may help! Thanks and cheers, Kat
That's after I excluded hypotheticals. BUT It turns out that those stats are all wrong, and include lots of chromosomes. WHY? https://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/ has all the CDS of chromosomes in it too WTF!
HI @tseemann, if I remember well my classes, plasmids are made of bacterial genes for a part. Could it be that? I just annotated my two natural plasmids using bacterial settings and it returned a number of ORFs among which known bacterial genes. What is may be missing are replication regions and other regulatory elements but at least the ORFs are there right?
For plasmid you will not get a good result if you just use the default settings.
I would recommend getting GENBANK files (.gbk or .gb) of all the plasmids that are similar to your one.
Say you get three of them p1.gbk p2.gbk p3.gbk Then make a single genbank file: cat p1.gbk p2.gbk p3.gbk > plasmids.gbk Then run prokka with: --proteins plasmids.gbk
That will give a much better names for the proteins in your plasmid.
The next verison of Prokka will have a proper plasmid database included.