nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Difference between clusters and smCOGs from antismash? #955

Open lukaskon opened 10 months ago

lukaskon commented 10 months ago

Are you using the latest release? 1.8.15

Describe the bug Can you please explain the difference between "clusters" and "smCOGs" parsed by antismash? Are "clusters" actually protoclusters? I could not find any details in the documentation, but maybe I missed it.

All the numbers I found in the log file are significantly higher that what was expected based on my organism's reference genome (Botrytis cinerea has approx. 45 key genes encoding biosynthetic enzymes for SM synthesis), and I am wondering if this is because I am interpreting them wrong, database updates since 2012, differences in repeat masking, etc. Any explanation or advice is appreciated!

What command did you issue?

Antismash direct results

When I looked at the number of .gbk files output from antismash, this is closer to what I would have expected. According to antismash, "A region in antiSMASH 5 and above corresponds to the gene cluster annotation in antiSMASH 4 and earlier.", so I would have thought this number would be the cluster number?

(funannotate) W18_smash$ find . -name '*.region???.gbk' | wc -l
44

Logfiles funannotate-annotate.log

Now parsing antiSMASH v6 results, finding SM clusters
[06/15/23 18:33:38]: Found 54 clusters, 127 biosynthetic enyzmes, and 149 smCOGs predicted by antiSMASH

OS/Install Information

funannotate versions.txt

hyphaltip commented 10 months ago

Hi 0- this would really be an antismash question, this tool is just parsing the results. I would encourage you to look at the html report that is in the antismash result folder you would have provided to funannotate.

I believe the answer from your report there are 54 SM clusters instead of the 45 perhaps in previous version of genome. I know Botrytris genome has been improved too, but it may also be that some of the 54 are not in the SM types that they decided to report, but antismash has updated a lot in v6 over v4.

lukaskon commented 10 months ago

Thank you for the quick response. I looked at the html from the antismash v6 results, and it also reports 44 regions (aka clusters, per their definition). I am confused why this is different from what funannotate reports (54); Is funannotate using different criteria to define an SM cluster? Thanks again.

nextgenusfs commented 10 months ago

I believe it is indeed parsing the protoclusters, which was what the clusters were called in v4. But now I think this is outdated with the "regions" idea that versions v5 and greater use. I've recently run some genomes through v7 and my personal feeling is that antiSMASH is being overly conservative with what it calls a "cluster region" and there are some new fungal-RIPP-like models (not sure if these are very accurate at all...). But the antiSMASH parsing code should likely be fixed in v6 and v7 to identify "regions" instead of protoclusters. I don't know when I'll have time to look at this.....

lukaskon commented 10 months ago

Great, I think we are on the same page. I also noticed the same thing when trying antismash v7. If this is of any use, here is the range reported for clusters identified in the same 7 isolates of Botrytis cinerea for each method.

Funannotate-annotate log (using antismash v6 input): 52-61 Command line antismash v6 (see command used above): 42-52 Browser antismash v7: 20-24 (including the B05.10 reference genome)