nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
317 stars 83 forks source link

MIBiG results antiSMASH #156

Closed AnotherSimon closed 6 years ago

AnotherSimon commented 6 years ago

Hi John,

In v1.3.0 I'm having an issue with funnanotate annotate --antismash to incorporate results from funannotate remote. It appears that results from antiSMASH are not properly parsed or a file path is not set correctly. Example output:

... [Mar 23 12:11 AM]: Found phobius pre-computed results [Mar 23 12:11 AM]: Predicting secreted proteins with SignalP [Mar 23 12:14 AM]: 171 secretome and 786 transmembane annotations added [Mar 23 12:14 AM]: Parsing InterProScan5 XML file [Mar 23 12:14 AM]: Now parsing antiSMASH results, finding SM clusters [Mar 23 12:14 AM]: Found 5 clusters, 165 biosynthetic enyzmes, and 17 smCOGs predicted by antiSMASH [Mar 23 12:14 AM]: Found 0 duplicated annotations, adding 33,894 valid annotations [Mar 23 12:14 AM]: Converting to final Genbank format, good luck! [Mar 23 12:14 AM]: Creating AGP file and corresponding contigs file [Mar 23 12:14 AM]: Cross referencing SM cluster hits with MIBiG database version 1.3 Traceback (most recent call last): File "/home/simon/software/funannotate/bin/funannotate-functional.py", line 1055, in with open(mibig_blast, 'rU') as input: IOError: [Errno 2] No such file or directory: '../MyBug/annotate_misc/antismash/smcluster.MIBiG.blast.txt'

I'm aware of issue #121 but doesn't seem to be the same bug.

nextgenusfs commented 6 years ago

I thought I might have addressed that with https://github.com/nextgenusfs/funannotate/commit/e65305a42ec4a3f4f76b13825d86af7fbd53eab8 commit. I'm not sure if the current tip is stable or not I'm away from office doing field work so difficult for me to check full functionality on my laptop. But if you are using version newer than that commit then it is obviously not fixed.

Do the annotations from antismash have the correct mRNA-IDs? They should end with -T1, etc (you can look at the text files in annotate_misc that correspond to the antiSMASH results. This again was bug introduced when trying to support multiple transcripts.

AnotherSimon commented 6 years ago

My version is definitely post e65305a. Sample from MyBug/annotate_misc/annotations.antismash.txt:

... FUN_001646-T1 product Nonribosomal Peptide Synthase (NRPS) FUN_002750-T1 product terpene cyclase FUN_001649-T1 note SMCOG1227:ribosome_biogenesis_GTP-binding_protein_YsxC FUN_001162-T1 note SMCOG1248:methyltransferase FUN_001642-T1 note SMCOG1173:WD-40_repeat-containing_protein ...

PS: None of the entries end in "-T2".

gamcil commented 6 years ago

Was having this issue with local fungiSMASH results with commit https://github.com/nextgenusfs/funannotate/commit/e1048c06ab55e74fcc682877197cd5d059686683, so don't think there's a difference in output between antiSMASH versions.

Found the problem in funannotate_functional.py - gene names from the predict_results proteins.fa are stripped of the -T1 suffix, but not those in the set of SM cluster proteins, causing the check to fail. So writing of smcluster.proteins.fasta fails, diamond search fails, and smcluster.MIBiG.blast.txt isn't created.

I'll try and create a pull request to fix (https://github.com/nextgenusfs/funannotate/pull/169).

iwangtoknow commented 6 years ago

This issue also appears in version 1.3.3

iwangtoknow commented 6 years ago

Dear Jon, I fixed the problem. Please take a look, am I right? I modified two python scripts.

  1. funannotate-1.3.3/bin/fuannotate-functional.py
  2. funannotate-1.3.3/lib/library.py In fuannotate-functional.py line 1062 #genename = record.id.split('-T')[0] genename = record.id in library.py line 2430,2431 #protout.write('>%s %s\n%s\n' % (geneInfo['ids'][i], g, geneInfo['protein'][i])) protout.write('>%s\n%s\n' % (geneInfo['ids'][i], geneInfo['protein'][i])) #transout.write('>%s %s\n%s\n' % (geneInfo['ids'][i], g, geneInfo['transcript'][i])) transout.write('>%s\n%s\n' % (geneInfo['ids'][i], geneInfo['transcript'][i]))

    I think maybe just need to modify #genename = record.id.split('-T')[0] genename = record.id.split('\s')[0]

nextgenusfs commented 6 years ago

What was the error that you were getting with v1.3.3? In Biopython, the record.id is automatically truncated at first space (the full name can be recovered with record.description), but at any rate if I can see the error then can get it fixed.

iwangtoknow commented 6 years ago

I met same error as this issue, and #121 , when I running funannotate annotate, IOError: [Errno 2] No such file or directory: ...smcluster.MIBiG.blast.txt and the annotate_misc/antismash/smcluster.proteins.fasta is empty.

iwangtoknow commented 6 years ago

linuxbrew@f2ef5e724f87:~/data$ funannotate annotate -i fun_20180530 -o fun_20180530 --iprscan fun_20180530/annotate_misc/iprscan.xml


[08:15 AM]: OS: linux2, 12 cores, ~ 37 GB RAM. Python: 2.7.14 [08:15 AM]: Running funannotate v1.3.3 [08:15 AM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt' [08:15 AM]: Output directory fun_20180530 already exists, will use any existing data. If this is not what you want, exit, and provide a unique name for output folder [08:15 AM]: Parsing input files [08:15 AM]: Adding Functional Annotation to Aspergillus nidulans, NCBI accession: None [08:15 AM]: Annotation consists of: 11,411 gene models [08:15 AM]: 11,877 protein records loaded [08:15 AM]: Existing Pfam-A results found: fun_20180530/annotate_misc/annotations.pfam.txt [08:15 AM]: 2,888 annotations added [08:15 AM]: Running Diamond blastp search of UniProt DB version 2018_05 [08:16 AM]: 935 valid gene/product annotations from 1,673 total [08:16 AM]: Existing Eggnog-mapper results found: fun_20180530/annotate_misc/eggnog.emapper.annotations [08:16 AM]: Parsing EggNog Annotations [08:16 AM]: 19,051 COG and EggNog annotations added [08:16 AM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.7 [08:16 AM]: 2,057 gene name and product description annotations added [08:16 AM]: Existing MEROPS results found: fun_20180530/annotate_misc/annotations.merops.txt [08:16 AM]: 329 annotations added [08:16 AM]: Existing CAZYme results found: fun_20180530/annotate_misc/annotations.dbCAN.txt [08:16 AM]: 910 annotations added [08:16 AM]: Existing BUSCO2 results found: fun_20180530/annotate_misc/annotations.busco.txt [08:16 AM]: 1,321 annotations added [08:16 AM]: Existing Phobius results found: fun_20180530/annotate_misc/phobius.results.txt [08:16 AM]: Existing SignalP results found: fun_20180530/annotate_misc/signalp.results.txt [08:16 AM]: 883 secretome and 2,850 transmembane annotations added [08:16 AM]: Now parsing antiSMASH results, finding SM clusters [08:16 AM]: Found 51 clusters, 631 biosynthetic enyzmes, and 368 smCOGs predicted by antiSMASH [08:16 AM]: Found 0 duplicated annotations, adding 80,223 valid annotations [08:16 AM]: Converting to final Genbank format, good luck! [08:18 AM]: Creating AGP file and corresponding contigs file [08:18 AM]: Cross referencing SM cluster hits with MIBiG database version 1.3 Traceback (most recent call last): File "/home/linuxbrew/funannotate/bin/funannotate-functional.py", line 1070, in with open(mibig_blast, 'rU') as input: IOError: [Errno 2] No such file or directory: 'fun_20180530/annotate_misc/antismash/smcluster.MIBiG.blast.txt'

nextgenusfs commented 6 years ago

Ok, I think this is finally working again: https://github.com/nextgenusfs/funannotate/commit/f40bed944ee3ceefd51c1ec2666532c1f0927403. The search now works correctly and then parsing results requires to drop the transcript number as the output of the sec met gene clusters uses the locus_tag (geneID) which won't have -T associated with it.

iwangtoknow commented 6 years ago

Thanks, Jon.

reslp commented 5 years ago

Hi, I think I need to reopen this. I am not 100% sure if it is the exact same issue but it sure looks like it from the funannotate output:

When I run funannotate annotate -i xyl_par_draftV1_preds --cpus 24 I get this:


[09:20 AM]: OS: linux2, 80 cores, ~ 791 GB RAM. Python: 2.7.15
[09:20 AM]: Running funannotate v1.5.2
[09:20 AM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[09:20 AM]: Output directory xyl_par_draftV1_preds already exists, will use any existing data. If this is not what you want, exit, and provide a unique name for output folder
[09:20 AM]: Parsing input files
[09:20 AM]: Existing tbl found: xyl_par_draftV1_preds/predict_results/Xylographa_parallela.tbl
[09:20 AM]: Adding Functional Annotation to Xylographa parallela, NCBI accession: None
[09:20 AM]: Annotation consists of: 9,017 gene models
[09:20 AM]: 8,952 protein records loaded
[09:20 AM]: Running HMMer search of PFAM version 32.0
[09:23 AM]: 9,934 annotations added
[09:23 AM]: Running Diamond blastp search of UniProt DB version 2019_02
[09:23 AM]: 0 valid gene/product annotations from 842 total
[09:23 AM]: Running Eggnog-mapper
[09:23 AM]: No Eggnog-mapper results found.
[09:23 AM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.31
[09:23 AM]: 0 gene name and product description annotations added
[09:23 AM]: Running Diamond blastp search of MEROPS version 12.0
[09:23 AM]: 273 annotations added
[09:23 AM]: Annotating CAZYmes using HMMer search of dbCAN version 7.0
[09:24 AM]: 365 annotations added
[09:24 AM]: Annotating proteins with BUSCO dikarya models
[09:24 AM]: 1,243 annotations added
[09:24 AM]: Existing Phobius results found: xyl_par_draftV1_preds/annotate_misc/phobius.results.txt
[09:24 AM]: SignalP not installed, secretome prediction less accurate using only Phobius [09:24 AM]: 0 secretome and 0 transmembane annotations added [09:24 AM]: Parsing InterProScan5 XML file [09:24 AM]: Now parsing antiSMASH results, finding SM clusters [09:24 AM]: Found 28 clusters, 394 biosynthetic enyzmes, and 128 smCOGs predicted by antiSMASH [09:24 AM]: Found 0 duplicated annotations, adding 44,131 valid annotations [09:24 AM]: Converting to final Genbank format, good luck! [09:25 AM]: Creating AGP file and corresponding contigs file [09:25 AM]: Cross referencing SM cluster hits with MIBiG database version 1.4 Traceback (most recent call last): File "/usr/local/src/funannotate-1.5.2/bin/funannotate-functional.py", line 1101, in with open(mibig_blast, 'rU') as input: IOError: [Errno 2] No such file or directory: 'xyl_par_draftV1_preds/annotate_misc/antismash/smcluster.MIBiG.blast.txt'

The proteins in the smcluster.proteins.fasta have names like this:

FUN_009004-T1 FUN_009004 FUN_009006-T1 FUN_009006

I am unsure what could be the problem. All dependencies are installed correctly, funannotate check doesn't show anything strange either.

Is there anything I can do to resolve this?

best,

Philipp

nextgenusfs commented 5 years ago

Look in the logfile at the diamond commands as it looks like those steps aren’t working normally. And then what version of diamond?

reslp commented 5 years ago

Hi,

It works now. I think the diamond command was indeed the issue. Somehow conda installed (or downgraded to) an old version of diamond (v.0.8.22) which didn't recognize the --max-hsps flag. I upgraded to 0.9.24 and after updating the databases the funannotate annotate finishes without errors.

Thank you for your help!

best,

Philipp