tseemann / nullarbor

:floppy_disk: :page_with_curl: "Reads to report" for public health and clinical microbiology
GNU General Public License v2.0
134 stars 37 forks source link

different CDS numbers #228

Closed skaralegui closed 4 years ago

skaralegui commented 5 years ago

Hi, I am comparing 8 S. marcescens strains and ran Nullarbor for it. When comparing my assembly table report (the CDS column) to the numbers that appeared in the pangenome graphic (to the right), they do not match. Are both data showing the CDS number of each strain or should I understand them as different things? If so, what is the number of the pangenome graphic meaning?

Thank you in advance! pan report.pdf

tseemann commented 5 years ago

The plot is a direct picture of what is in the roary gene_presence_absence.csv file. It's the number of ortholog clusters each isolate has members in. If there are gene duplications (paralogs) then they often only count as 1 ortholog cluster. So the number on the right will always be LESS than the number of CDS in the sample. The number at the bottom is the total number of ortholog clusters in the roary output ie. the pan genome. Depending on roary settings, this may also exclude small proteins.

skaralegui commented 5 years ago

Thank you for your response. I understand now and the explanaition makes sense. Nevertheless, I found a problem: the number of CDS in the assembly report (I have attached it in the previous message) is smaller than the number of the ortholog clusters that appears in the pangenome graphic, just the opposite as you have said...

tseemann commented 5 years ago

The number of CDS in Prokka is different to the number of orthologs because small CDS are not included in roary, and also roary collapses paralogs (duplicated genes) into a single cluster.