nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
317 stars 83 forks source link

AntiSMASH data were not properly processed #736

Open jae0326 opened 2 years ago

jae0326 commented 2 years ago

Are you using the latest release? version: 1.8.9

Describe the bug AntiSMASH data do not seem to be incorporated properly during the annotate function. Although there was no error during the analysis, only nine clusters were shown in 'annotations.antismash.clusters.txt' as follow. Each cluster has more than hundred genes.

CNY61_000480-T1 note antiSMASH:Cluster_1 CNY61_000481-T1 note antiSMASH:Cluster_1 CNY61_000482-T1 note antiSMASH:Cluster_1 CNY61_000483-T1 note antiSMASH:Cluster_1 CNY61_000484-T1 note antiSMASH:Cluster_1 CNY61_002216-T1 note antiSMASH:Cluster_1 CNY61_002217-T1 note antiSMASH:Cluster_1 CNY61_002218-T1 note antiSMASH:Cluster_1 CNY61_002219-T1 note antiSMASH:Cluster_1 CNY61_002220-T1 note antiSMASH:Cluster_1 CNY61_002222-T1 note antiSMASH:Cluster_1 CNY61_002223-T1 note antiSMASH:Cluster_1 CNY61_002224-T1 note antiSMASH:Cluster_1 CNY61_002225-T1 note antiSMASH:Cluster_1 CNY61_005335-T1 note antiSMASH:Cluster_1 CNY61_005336-T1 note antiSMASH:Cluster_1 CNY61_005337-T1 note antiSMASH:Cluster_1 CNY61_005338-T1 note antiSMASH:Cluster_1 CNY61_005340-T1 note antiSMASH:Cluster_1 CNY61_005341-T1 note antiSMASH:Cluster_1 CNY61_005342-T1 note antiSMASH:Cluster_1 CNY61_005343-T1 note antiSMASH:Cluster_1 CNY61_005344-T1 note antiSMASH:Cluster_1 CNY61_005345-T1 note antiSMASH:Cluster_1 CNY61_005346-T1 note antiSMASH:Cluster_1

.......(more than hundred lines)

CNY61_013295-T1 note antiSMASH:Cluster_1 CNY61_013296-T1 note antiSMASH:Cluster_1 CNY61_013297-T1 note antiSMASH:Cluster_1 CNY61_013298-T1 note antiSMASH:Cluster_1 CNY61_013299-T1 note antiSMASH:Cluster_1 CNY61_000691-T1 note antiSMASH:Cluster_2 CNY61_000692-T1 note antiSMASH:Cluster_2 CNY61_000693-T1 note antiSMASH:Cluster_2 CNY61_000694-T1 note antiSMASH:Cluster_2 CNY61_000695-T1 note antiSMASH:Cluster_2 CNY61_000696-T1 note antiSMASH:Cluster_2 CNY61_000697-T1 note antiSMASH:Cluster_2 CNY61_000698-T1 note antiSMASH:Cluster_2 ....

AntiSMASH analysis was performed manually at the server homepage (https://fungismash.secondarymetabolites.org/#!/start). The .gbk file was downloaded and used for annotate function.

What command did you issue? funannotate annotate -i fun_pred --cpus 32 --sbt template_js0361.sbt --phobius ./fun_pred/annotate_misc/phobius_results.txt --antismash ./fun_pred/annotate_misc/Colletotrichum_nymphaeae_JS-0361.gbk --iprscan ./fun_pred/annotate_misc/iprscan_result.xml -s "Colletotrichum nymphaeae" --isolate JS-0361

Logfiles Please provide relavent log files of the error.

[05/15/22 07:01:23]: Now parsing antiSMASH v6 results, finding SM clusters [05/15/22 07:01:27]: Found 68 clusters, 154 biosynthetic enyzmes, and 242 smCOGs predicted by antiSMASH [05/15/22 07:01:34]: Found 0 duplicated annotations, adding 107,113 valid annotations [05/15/22 07:01:35]: Parsing tbl file: /home/linu/workspace/genomes/Colletotrichum/JS-0361/GenomeAssembly/fun_pred/annotate_misc/genome.tbl [05/15/22 07:01:36]: Converting to final Genbank format, good luck! [05/15/22 07:01:36]: /home/linu/anaconda3/envs/funannotate/bin/python /home/linu/anaconda3/envs/funannotate/lib/python3.7/site-packages/funannotate/aux_scripts/tbl2asn_parallel.py -i fun_pred/annotate_misc/tbl2asn/genome.tbl -f fun_pred/annotate_misc/tbl2asn/genome.fsa -o fun_pred/annotate_misc/tbl2asn --sbt template_js0361.sbt -d discrepency.report.txt -s Colletotrichum nymphaeae -t -l paired-ends -v 1 -c 32 --isolate JS-0361 [05/15/22 07:02:55]: Creating AGP file and corresponding contigs file [05/15/22 07:02:57]: Cross referencing SM cluster hits with MIBiG database version 1.4 [05/15/22 07:02:57]: diamond blastp --sensitive --query fun_pred/annotate_misc/antismash/smcluster.proteins.fasta --threads 32 --out fun_pred/annotate_misc/antismash/smcluster.MIBiG.blast.txt --db /home/linu/workspace/tools/funannotate_db/mibig.dmnd --max-hsps 1 --evalue 0.001 --max-target-seqs 1 --outfmt 6 [05/15/22 07:03:04]: Creating tab-delimited SM cluster output [05/15/22 07:03:09]: Writing genome annotation table. [05/15/22 07:04:35]: Funannotate annotate has completed successfully!

OS/Install Information

Ubuntu 18.04.6 LTS /

Checking dependencies for 1.8.9 You are running Python v 3.7.10. Now checking python packages... biopython: 1.77 goatools: 1.1.12 matplotlib: 3.4.3 natsort: 8.1.0 numpy: 1.21.5 pandas: 1.3.5 psutil: 5.9.0 requests: 2.27.1 scikit-learn: 1.0.2 scipy: 1.7.3 seaborn: 0.11.2 All 11 python packages installed

You are running Perl v b'5.026002'. Now checking perl modules... Bio::Perl: 1.7.4 Carp: 1.38 Clone: 0.42 DBD::SQLite: 1.64 DBD::mysql: 4.046 DBI: 1.642 DB_File: 1.855 Data::Dumper: 2.173 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.5 Hash::Merge: 0.300 JSON: 4.02 LWP::UserAgent: 6.39 Logger::Simple: 2.0 POSIX: 1.76 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.12 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.29 threads: 2.15 threads::shared: 1.56 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/home/linu/workspace/tools/funannotate_db $PASAHOME=/home/linu/anaconda3/envs/funannotate/opt/pasa-2.4.1 $TRINITY_HOME=/home/linu/anaconda3/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/home/linu/anaconda3/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/home/linu/anaconda3/envs/funannotate/config/ $GENEMARK_PATH=/home/linu/workspace/tools/genemark/gmes_linux_64 All 6 environmental variables are set

Checking external dependencies... PASA: 2.4.1 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.3.3 bamtools: bamtools 2.5.1 bedtools: bedtools v2.30.0 blat: BLAT v36 diamond: 2.0.14 emapper.py: 2.1.7 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: no way to determine glimmerhmm: 3.0.4 gmap: 2017-11-15 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 11.0.13 kallisto: 0.46.1 mafft: v7.490 (2021/Oct/30) makeblastdb: makeblastdb 2.11.0+ minimap2: 2.24-r1122 proteinortho: 6.0.33 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.12 signalp: 4.1 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.9 (July 2021) tantan: tantan 26 tbl2asn: no way to determine, likely 25.X tblastn: tblastn 2.11.0+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: gmes_petap.pl not installed

nextgenusfs commented 2 years ago

Okay, sounds like they have changed the cluster tags in GenBank format for antiSMASH v6..... I'll need to see what it is now, it is using protocluster tags I think currently to define a cluster.

IanDMedeiros commented 2 years ago

I am seeing the same discrepancy. Just to make sure I understand the problem, the issue is only that multiple clusters are all getting called Cluster_1 (or Cluster_2 or Cluster_3), not that there is any other problem with the annotations?

ernesfranco commented 2 years ago

Hi! Thank you for developing this great tool!

I am having an issue that I think is related to this post:

Funannotate compare creates a "secmet" folder containing a table and a graph showing counts for "Other: other backbone enzyme" only.

Thank you!

Funannotate version: 1.8.12 (conda)
AntiSMASH version: 6.1.1

$ funannotate mask -i input -o assembly.fa
$ funannotate predict -i assembly.fa -o fun --species species --strain strain --busco_db eurotiomycetes
$ funannotate iprscan -i fun -m docker
$ funannotate remote -i fun -m phobius antismash -e email
$ funannotate annotate -i fun --busco_db eurotiomycetes
$ funannotate compare -i gbk_files_from_funannotate_annotate

Screenshot from 2022-08-30 15-09-41

IanDMedeiros commented 2 years ago

The issue from the original post and my response (not sure about @ernesfranco) seems to be that antiSMASH numbers clusters at the contig level, and parses those numbers intelligibly in the online graphical output (i.e., cluster 1.1, 1.2, 1.3, 2.1, 2.2, 4.1, etc.), but outputs a gbk file that just uses the contig cluster numbers without modification (hence why there can be as many cluster_1's as there are contigs). Perhaps this is the desired behavior for antiSMASH, but it seems weird.

My current workaround is to (1) run a script to renumber the clusters in annotate_misc/annotations.antismash.clusters.txt and (2) rerun funannotate annotate with an additional argument I have added that uses the edited annotations.antismash.clusters.txt when merging all of the annotations. Note that this cluster numbering may not match the number of clusters in annotate_misc/antismash/clusters.bed because some of the clusters delimited in that file can have identical ranges... I'm not certain that that is a bug, but it isn't something that shows up when you view the same antismash run in the online graphical output.

IanDMedeiros commented 1 year ago

Hi @ernesfranco: I was able to reproduce your error with my own data and I think it is another issue with how antiSMASH v6 output is getting parsed... in this case it seems to be because of more flexibility in the product names assigned to NRPS and PKS genes? I have a solution that is working and will submit a pull request shortly.