tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
850 stars 226 forks source link

Self-made Genus Database is invalid #372

Closed YiweiZhu closed 5 years ago

YiweiZhu commented 5 years ago

I build a Pseudomonas database according to the codes on the website. After cd-hit completes, I don't get .bak.clstr file. And after makebalstdb completes, I only get three .p files, that is .phr, .pin, . psq. The --listdb doesn't show my database, even if I move these files into db filefold. Which step is wrong?

$ prokka-genbank_to_fasta_db Pseudomonas_aeruginosa_12-4-4_59__3618.gbk > Pseudomonas.faa Will use first of (protein_id locus_tag db_xref) as FASTA ID Parsing: NZ_CP013696 Done.

$ cd-hit -i Pseudomonas.faa -o Pseudomonas -T 0 -M 0 -g 1 -s 0.8 -c 0.9 Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47 Command: cd-hit -i Pseudomonas.faa -o Pseudomonas -T 0 -M 0 -g 1 -s 0.8 -c 0.9 Started: Tue Mar 19 15:41:19 2019 Output
Option -T is ignored: multi-threading with OpenMP is NOT enabled! total seq: 3939 longest and shortest : 4991 and 23 Total letters: 1443332 Sequences have been sorted Approximated minimal memory consumption: Sequence : 1M Buffer : 1 X 11M = 11M Table : 1 X 65M = 65M Miscellaneous : 0M Total : 79M Table limit with the given memory limit: Max number of representatives: 4000000 Max number of word counting entries: 265995500 comparing sequences from 0 to 3939 ... 3939 finished 3917 clusters Approximated maximum memory consumption: 90M writing new database writing clustering information program completed ! Total CPU time 0.88

$ makeblastdb -dbtype prot -in Pseudomonas Building a new DB, current time: 03/19/2019 15:42:31 New DB name: /data3/zyw/gbk/Pseudomonas New DB title: Pseudomonas Sequence type: Protein Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 3917 sequences in 0.204363 seconds. $ prokka --listdb [15:49:25] Looking for databases in: /data3/zyw/miniconda3/bin/../db [15:49:25] Kingdoms: Archaea Bacteria Mitochondria Viruses [15:49:25] Genera: Enterococcus Escherichia Staphylococcus [15:49:25] HMMs: HAMAP [15:49:25] CMs: Bacteria Viruses

ealdraed commented 5 years ago

Hello @YiweiZhu !

Please make sure your files reside under /data3/zyw/miniconda3/db/, i. e.: Pseudomonas.p[hr|in|sq]. Your DB seems to sit under /data3/zyw/gbk/. The clustering step in the Readme may not be required with a single genome. It is employed if you want to "join" several genomes of the same Genus but want to avoid redundancies by homologous proteins. If you apply CD-HIT to a single genome you probably filter for paralogs (which are homologs in a wider sense but evolved through gene duplication rather than speciation (=orthologs).

tseemann commented 5 years ago

@ealdraed why is bioconda putting db into the root folder of miniconda?

tseemann commented 5 years ago

@YiweiZhu the simplest thing to do is to just use

prokka --proteins Pseudomonas_aeruginosa_12-4-4_59__3618.gbk ...