Difference between clusters and smCOGs from antismash?

lukaskon commented 10 months ago

Are you using the latest release? 1.8.15

Describe the bug Can you please explain the difference between "clusters" and "smCOGs" parsed by antismash? Are "clusters" actually protoclusters? I could not find any details in the documentation, but maybe I missed it.

All the numbers I found in the log file are significantly higher that what was expected based on my organism's reference genome (Botrytis cinerea has approx. 45 key genes encoding biosynthetic enzymes for SM synthesis), and I am wondering if this is because I am interpreting them wrong, database updates since 2012, differences in repeat masking, etc. Any explanation or advice is appreciated!

What command did you issue?

Antismash direct results

When I looked at the number of .gbk files output from antismash, this is closer to what I would have expected. According to antismash, "A region in antiSMASH 5 and above corresponds to the gene cluster annotation in antiSMASH 4 and earlier.", so I would have thought this number would be the cluster number?

(funannotate) W18_smash$ find . -name '*.region???.gbk' | wc -l
44

Logfiles funannotate-annotate.log

Now parsing antiSMASH v6 results, finding SM clusters
[06/15/23 18:33:38]: Found 54 clusters, 127 biosynthetic enyzmes, and 149 smCOGs predicted by antiSMASH

OS/Install Information

funannotate versions.txt

hyphaltip commented 10 months ago

Hi 0- this would really be an antismash question, this tool is just parsing the results. I would encourage you to look at the html report that is in the antismash result folder you would have provided to funannotate.

I believe the answer from your report there are 54 SM clusters instead of the 45 perhaps in previous version of genome. I know Botrytris genome has been improved too, but it may also be that some of the 54 are not in the SM types that they decided to report, but antismash has updated a lot in v6 over v4.

smCOGs are particular type of genes which are conserved and found in SM - there may be more than one of these genes in a given cluster (eg a cluster might have a NRPS and a p450, presumably both of these can be smCOGs).
from antismash paper "Secondary Metabolite specific Clusters of Orthologous Groups (smCoGs)" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3125804/

Finally, from the genes within this database of gene clusters, we constructed secondary metabolism Clusters of Orthologous Groups (smCOGs). These are used in yet another module to predict and categorize the functions of accessory genes, and to calculate phylogenetic trees for each gene with a seed alignment of its smCOG protein family. Our benchmark results show that our method reliably detects gene clusters of a wide variety of biosynthetic types, and that it is able to significantly enhance manual genome annotations of secondary metabolite biosynthesis.

@nextgenusfs will perhaps have additional answers.

lukaskon commented 10 months ago

Thank you for the quick response. I looked at the html from the antismash v6 results, and it also reports 44 regions (aka clusters, per their definition). I am confused why this is different from what funannotate reports (54); Is funannotate using different criteria to define an SM cluster? Thanks again.

nextgenusfs commented 10 months ago

I believe it is indeed parsing the protoclusters, which was what the clusters were called in v4. But now I think this is outdated with the "regions" idea that versions v5 and greater use. I've recently run some genomes through v7 and my personal feeling is that antiSMASH is being overly conservative with what it calls a "cluster region" and there are some new fungal-RIPP-like models (not sure if these are very accurate at all...). But the antiSMASH parsing code should likely be fixed in v6 and v7 to identify "regions" instead of protoclusters. I don't know when I'll have time to look at this.....

lukaskon commented 10 months ago

Great, I think we are on the same page. I also noticed the same thing when trying antismash v7. If this is of any use, here is the range reported for clusters identified in the same 7 isolates of Botrytis cinerea for each method.

Funannotate-annotate log (using antismash v6 input): 52-61 Command line antismash v6 (see command used above): 42-52 Browser antismash v7: 20-24 (including the B05.10 reference genome)

nextgenusfs / funannotate

Difference between clusters and smCOGs from antismash? #955