Metagenomics (many samples): Prokka as input to GhostKOALA

tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation

834 stars 226 forks source link

Metagenomics (many samples): Prokka as input to GhostKOALA #228

Open willnotburn opened 7 years ago

willnotburn commented 7 years ago

This is for annotation with KOs of a metagenomic project, sampled from many locations. I am looking to use the workflow: assembled contigs -> Prokka -> proteins WITH reference to original contigs -> GhostKOALA -> KOs

The goal is a table of KOs in rows, samples in columns, with KO abundances in each sample populating the table.

For annotation with KOs, GhostKOALA takes amino acid sequences, presumably with protein headers. Prokka outputs translated CDS with headers in .faa file. Perfect! But to get abundances of KOs in samples, I need to know where the proteins come from i.e. which contig(s). The contig abundances in samples are calculated via mapping in a separate step...

Does Prokka output info that connects proteins in .faa file with original input contigs?

tseemann commented 7 years ago

I think all the information you need is in the output GBK and GFF files?

You can look at each CDS (column 3) feature in the .gff file and extract the ID=xxxxx (column 9) and the contig is in column 1.

If you have access to the KEGG ortholog database (I assume it is not free/open anymore?) then you can create a custom Prokka database and provide it via --proteins and even annotate the KO directly.