Twice as many annotations as expected

tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation

833 stars 226 forks source link

Twice as many annotations as expected #339

Closed laurafisch9 closed 6 years ago

laurafisch9 commented 6 years ago

An assembled genome with about 2,000,000 bp was annotated with prokka. If one gene is ~1,000 bp, then it would be expected to return 2,000 genes. Yet my .faa file has ~4,000 genes in them. It looks like in the .gff file has a gene call using prokka and prodigal. There for the genes may have been called twice and this would make sense why there are twice as many genes in the .gff file. The picture below shows the prodigal and the prokka gene call for the same position in the genome and the same locus tag as well.

screen shot 2018-09-20 at 7 36 35 pm

I am wondering why then does the .faa file (sample below) has ~4,000 genes when I was only expecting to see 2,000.

screen shot 2018-09-20 at 7 36 54 pm

andersgs commented 6 years ago

Hi @laurafisch9 note that your first table you have two entries per annotation: a gene and a CDS. I assume you are working with a bacteria, which often have gene and CDS that completely overlap, but it may include other regions (3' and 5' UTR, etc). In more thorough annotations, this might also include mRNA, etc. Check out the description of GFF3 format: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Can you confirm that there indeed ~4000 entries in your FAA file. Could you run grep -c '>' *faa.

What is the bug? Are you positive you have a good assembly, and the data come from a pure culture?

laurafisch9 commented 6 years ago

@andersgs Yes, I have confirmed that there are ~4,000 entries in the FAA file This assembly came from a metagenome data set and the over all quality of reads is lower than average. It is not the best assembly. It enough base pairs are ambiguous could it be that prokka finds more genes to annotate then really exist?

andersgs commented 6 years ago

If it is a metagenomic sample, you may have all sorts of things in there, many of which might have variants. Prokka was never designed for metagenomic samples. How fragmented is the assembly? Have you filtered your assembly to remove, say, contigs smaller than 500bp? Have you tried to run Kraken or some other kmer ID tool to identify what you may have? You can also load the contigs and GFF into a tool to visualise your annotations (e.g., Genenious). Is there anything funny going on? Finally, what is the mean length of the predicted proteins? Does it match your expectation of ~1000bp? A highly fragmented assembly can cause some issue for Prokka.

laurafisch9 commented 6 years ago

I have not heard of Kraken or Genenious. Thank you for the suggestions I will look into all of them.

tseemann commented 6 years ago

If your assembly is poor or you have a number of contigs close to the number of expected genes, then most genes will be broken and will appear as two partial genes, making it look like you have twice as many genes.

laurafisch9 commented 6 years ago

@andersgs Do you know how to go about finding the the mean length of the predicted proteins?