oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
448 stars 55 forks source link

Inquiry about the coding density equation #281

Closed AhmedElsherbini closed 5 months ago

AhmedElsherbini commented 7 months ago

Hi Oliver,

I hope you do well,

Here, it is just an inquiry about the equation of coding density.

The background of the questions, I am comparing two sister species (the first is a successful commensal and the second is a rare pathogen). the coding density stats from Bakta show a significant difference between them ~2 %. I have just a few long read sequences <10 genome, in comparison to many short read assemblies > 80 genomes for both of them. The thing is both species, using long-read sequencing, I have no significant difference in genome length, the same as CDS. My only hit, is that the species with fewer coding sequences are highly enriched with mobilome (is elements), which could be something that affects this calculation.

knowing that pseudogenes are on average 20 in this rare species that has less coding density in comparison to 8 in the first species relative.

As unique genes extraction, using the default parameter in Panaroo, PPanGGOLiN, I can see ~ 113 proteins unique for the first species) and 80 for the second, which is hard for me to relate to the 2 % difference in the coding density.

Thank you in advance.

Best, Ahmed

oschwengers commented 7 months ago

Hi Ahmed, Of course, short/long read sequencing indeed can have a strong impact on assembly lengths. Especially short read draft assemblies can suffer from too many contig edges which make it very difficult to detect all CDS proximate or even spanning contig edges. The coding density is only summing up all bases that are part of an annotated genome feature divided by the genome length. I hope this helps. Otherwise, can you elaborate a bit more?

AhmedElsherbini commented 7 months ago

Thank you Oliver for your response.

Summing up all the bases of the annotated genome is (~the sum of the whole length of CDS) / the sum of the total contigs length, right? if we focus on long read assemble genomes and NO statistical difference between the two species's CDS number, nor total length, so my guess now for the difference in coding density, could be longer CDS in the more species with high density, or pseudogenes, / transposons (they are short in lenght anyway) in the less dense species. does it make sense now?

Best, Ahmed

oschwengers commented 6 months ago

Hi, the coding density is not limited to CDS but comprises all genomic features, e.g. non-coding RNA genes, regulatory elements, DNA motifs, etc.

Regarding your question, I guess in theory yes, that could be, but I'd be rather reluctant to use these kind of statistics. I'd rather directly compare certain genes presence/absence, etc.

AhmedElsherbini commented 6 months ago

Absolutely, you are right.

I followed this with tools like Panaroo and PPanGGolin for gene presence/absence comparison. Mobilome ( insertion elements ) is the main thing being enriched in the sister species with lower coding density.

Just, I wanted to investigate the causality of this coding density, as I get a lot of questions regarding this 2 % difference.