nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
314 stars 83 forks source link

annotate_misc/antismash/smcluster.proteins.fasta is zero-size #121

Closed gskrasnov closed 6 years ago

gskrasnov commented 6 years ago

Dear Jon,

Thank for the latest release of funannotate. I tried it on Linum usitatissimum genome and almost everything went fine. However, InterProScan search is still running for 10+ days and hang up at "90%".

30/12/2017 07:07:37:382 Uploaded/Stored 61793 sequences for analysis
30/12/2017 14:25:01:931 25% completed
30/12/2017 19:17:56:813 50% completed
31/12/2017 04:28:08:500 75% completed
01/01/2018 06:03:25:055 90% completed

I have split *proteins.fa from update_results into 12 parts of 5000 proteins each and then run InterProScan in parallel (4 threads each). It consumed only a day.

Then I merged the derived XML files (excluding 2 first and 1 last strings, except for 1th and the last parts) into a single file.

Why not include this procedure into funannotate?

Also I want to note that funannotate annotate looks for smcluster.MIBiG.blast.txt which is derived by mapping annotate_misc/antismash/smcluster.proteins.fasta file to MiBiG database with Diamond/blastp. However, smcluster.proteins.fasta has zero size and smcluster.MIBiG.blast.txt is not generated.

I had turned ON Additional analysis ( Compare to plantiSMASH predicted clusters and Compare to registered known clusters from MIBiG Database) in antiSMASH search parameters. However, I got the following:

[02:53:25 AM]: Cross referencing SM cluster hits with MIBiG database version 1.3
Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-functional.py", line 1102, in <module>
    with open(mibig_blast, 'rU') as input:
IOError: [Errno 2] No such file or directory: '901.fun_out.wo.RepMod/annotate_misc/antismash/smcluster.MIBiG.blast.txt'

Also I was unable to find funannotate-annotate or funannotate-functional log file in logfiles directory

I shared run data here

nextgenusfs commented 6 years ago

Hi George, thanks for the bug report. Perhaps the plantismash results are different than they are for fungi (that's what I'm betting on..), but I will take a look at your data and see.

The Logfiles don't get into the right place for annotate if the script fails - basically this is because of how I parse the input, the script doesn't immediately know where the output directory is, so instead it waits until the end to move the log file into the proper location. I will look at this as well and see if can be improved.

In terms of interproscan, I think it wouldn't be a huge effort to add a wrapper script of sorts that will run IPRSCAN either locally or through Docker. It could split the protein sequences and then run IPR jobs simultaneously. How are you running it, through Docker or Local install?

gskrasnov commented 6 years ago

I run Interproscan via Docker, as it was suggested by funannotate interproscan_docker.sh.

I think that the optimal strategy will depend on a machine's RAM. One should run a maximum number of interproscan_docker.sh processes with a minimal number of threads (ideally, 2-4 threads, -c) within each copy. This will minimize runtime. However, it may consume lots of RAM.

nextgenusfs commented 6 years ago

Okay, i converted the shell script to a python script that should do the splitting and combining. Haven't had a lot of time to test it --> its running now but will take a little bit of time to figure out what is the optimum configuration, i.e. number of fasta files per chunk, number of threads (cpus) per docker instance, etc. And you bring up the RAM as well, currently I'm running 20 total cpus, so 5 docker instances running 4 threads each --> RAM usage seems to be about 18 GB.

The script is here: https://github.com/nextgenusfs/funannotate/blob/master/util/funannotate-iprscan.py

gskrasnov commented 6 years ago

Thank you. I have also mentioned that one instance of InterProScan (from Docker) consumes ~5Gb RAM.

nextgenusfs commented 6 years ago

In terms of the plantismash results, it looks like the ID's weren't parsed correctly, there seems to be a ';' at the end of the ID that for some reason I don't see in the fungismash results. Either way, I think the last two commits should solve it, you can download current lib/library.py and replace your existing one and see if that works. Let me know if it does not.

gskrasnov commented 6 years ago

It seems that everything went well!

2018-01-12 19:54:23,348: /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-functional.py -i 901.fun_out.wo.RepMod --cpus 32 --sbt template_flax_genome_2017.11.16.mod.sbt --eggnog ./901.EGGONOG.mapper.dir/901.EGGNOGG.mapper.results.emapper.annotations --antismash ./d878a4a6-c579-49cc-a209-b2ee1974435d/jcf7180002550916.final.gbk --iprscan Linum_usitatissimum.proteins.901.fa.interproscan.xml --busco_db embryophyta --database /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/funannotate.DB

2018-01-12 19:54:23,365: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
2018-01-12 19:54:23,532: Running funannotate v1.0.2
2018-01-12 19:54:23,576: Output directory 901.fun_out.wo.RepMod already exists, will use any existing data.  If this is not what you want, exit, and provide a unique name for output folder
2018-01-12 19:54:23,576: Parsing input files
2018-01-12 19:55:17,708: Adding Functional Annotation to Linum usitatissimum, NCBI accession: None
2018-01-12 19:55:17,709: Annotation consists of: 65,024 gene models
2018-01-12 19:55:17,794: 64,087 protein records loaded
2018-01-12 19:55:19,717: Running HMMer search of PFAM version 31.0
2018-01-12 19:55:19,718: 11,048 annotations added
2018-01-12 19:55:19,719: Running Diamond blastp search of UniProt DB version 2017_12
2018-01-12 19:55:38,276: 9,555 valid gene/product annotations from 14,251 total
2018-01-12 19:55:38,314: Parsing EggNog Annotations
2018-01-12 19:55:38,698: 82,524 COG and EggNog annotations added
2018-01-12 19:55:38,698: Combining UniProt/EggNog gene and product names using Gene2Product version 1.3
2018-01-12 19:55:41,566: 11,093 gene name and product description annotations added
2018-01-12 19:55:41,566: Running Diamond blastp search of MEROPS version 12.0
2018-01-12 19:55:41,566: 1,555 annotations added
2018-01-12 19:55:41,566: Annotating CAZYmes using HMMer search of dbCAN version 6.0
2018-01-12 19:55:41,567: 2,460 annotations added
2018-01-12 19:55:41,567: Annotating proteins with BUSCO embryophyta models
2018-01-12 19:55:41,567: 2,331 annotations added
2018-01-12 19:55:41,567: Found phobius pre-computed results
2018-01-12 19:55:41,592: Skipping secretome: neither SignalP nor Phobius installed
2018-01-12 19:55:41,592: 0 secretome and 0 transmembane annotations added
2018-01-12 19:55:42,989: Parsing InterProScan5 XML file
2018-01-12 19:55:42,992: /usr/bin/python /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/util/iprscan2annotations.py 901.fun_out.wo.RepMod/annotate_misc/iprscan.xml 901.fun_out.wo.RepMod/annotate_misc/annotations.iprscan.txt
2018-01-12 19:57:01,390: Now parsing antiSMASH results, finding SM clusters
2018-01-12 19:57:18,992: Found 85 clusters, 394 biosynthetic enyzmes, and 0 smCOGs predicted by antiSMASH
2018-01-12 19:57:18,993: bedtools intersect -wo -a 901.fun_out.wo.RepMod/annotate_misc/antismash/clusters.bed -b /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_results/Linum_usitatissimum.gff3
2018-01-12 19:57:21,848: Found 384 duplicated annotations, adding 303,284 valid annotations
2018-01-12 19:57:25,267: Converting to final Genbank format, good luck!
2018-01-12 19:57:25,457: tbl2asn -y "Annotated using funannotate v1.0.2" -N 1 -p 901.fun_out.wo.RepMod/annotate_misc/tbl2asn -t template_flax_genome_2017.11.16.mod.sbt -M n -Z discrepency.report.txt -j "[organism=Linum usitatissimum]" -V b -c fx -T -a r10u
2018-01-12 20:13:25,451: [tbl2asn] Flatfile genome

[tbl2asn] Validating genome

2018-01-12 20:14:08,571: Creating AGP file and corresponding contigs file
2018-01-12 20:14:08,571: perl /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/util/fasta2agp.pl Linum_usitatissimum.scaffolds.fa
2018-01-12 20:14:16,453: Cross referencing SM cluster hits with MIBiG database version 1.3
2018-01-12 20:14:17,333: diamond blastp --sensitive --query 901.fun_out.wo.RepMod/annotate_misc/antismash/smcluster.proteins.fasta --threads 32 --out 901.fun_out.wo.RepMod/annotate_misc/antismash/smcluster.MIBiG.blast.txt --db /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/funannotate.DB/mibig.dmnd --max-hsps 1 --evalue 0.001 --max-target-seqs 1 --outfmt 6
2018-01-12 20:14:21,426: diamond v0.9.14.115 | by Benjamin Buchfink <buchfink@gmail.com>
Licensed under the GNU AGPL <https://www.gnu.org/licenses/agpl.txt>
Check http://github.com/bbuchfink/diamond for updates.

.................
.................

Total time = 4.05364s
Reported 424 pairwise alignments, 424 HSPs.
424 queries aligned.

2018-01-12 20:14:21,429: Creating tab-delimited SM cluster output
2018-01-12 20:14:58,887: Writing genome annotation table.
2018-01-12 20:15:36,932: Funannotate annotate has completed successfully!
2018-01-12 20:15:36,933: To fix gene names/product deflines, manually fix or can remove in 901.fun_out.wo.RepMod/annotate_results/Gene2Products.must-fix.txt

   funannotate annotate -i 901.fun_out.wo.RepMod --fix fixed_file.txt --remove delete.txt

I'm little confising with the following:

2018-01-12 19:55:41,567: Found phobius pre-computed results
2018-01-12 19:55:41,592: Skipping secretome: neither SignalP nor Phobius installed
2018-01-12 19:55:41,592: 0 secretome and 0 transmembane annotations added

Is this OK?

gskrasnov commented 6 years ago

I also run funannotate compare in order to compare two just annotated flax genomes and got the following:

funannotate compare --cpus 60 --database /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/funannotate.DB --out 901.903.compare -i /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod /mnt/raid/illumina/geo/Flax-2017.Annotation.903/funannotate/903.fun_out.wo.RepMod
-------------------------------------------------------
[08:45:16 PM]: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
[08:45:16 PM]: Running funannotate v1.0.2
[08:45:16 PM]: Now parsing 2 genomes
[08:45:47 PM]: working on Linum usitatissimum
[08:48:52 PM]: working on Linum usitatissimum
[08:51:27 PM]: Summarizing secondary metabolism gene clusters
[08:51:29 PM]: Summarizing PFAM domain results
[08:51:31 PM]: Summarizing InterProScan results
[08:51:33 PM]: Loading InterPro descriptions
[08:51:39 PM]: Summarizing MEROPS protease results
[08:51:40 PM]: found 13/93 MEROPS familes with stdev >= 1.000000
[08:51:40 PM]: Summarizing CAZyme results
[08:51:40 PM]: found 41/148 CAZy familes with stdev >= 1.000000
[08:51:41 PM]: Summarizing COG results
[08:51:42 PM]: No SignalP annotations found
[08:51:42 PM]: Summarizing fungal transcription factors
[08:51:43 PM]: Running GO enrichment for each genome
[08:55:55 PM]: Running orthologous clustering tool, ProteinOrtho5.  This may take awhile...
Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-compare.py", line 754, in <module>
    df = pd.read_csv(os.path.join(args.out, 'protortho', 'funannotate.poff'), sep='\t', header=0)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 405, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 764, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 985, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1605, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 394, in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)
  File "pandas/_libs/parsers.pyx", line 710, in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)
IOError: File 901.903.compare/protortho/funannotate.poff does not exist

Here is the results of funannotate annotate Thank you in advance!

nextgenusfs commented 6 years ago

First about the Phobius/SignalP, it wasn't able to find them in your path, so just need to install each tool and put in your PATH.

It looks like proteinortho5 tool failed, any clues in the log file?

nextgenusfs commented 6 years ago

And I messed up the output "text" for annotate, it should be telling you in the terminal that you have lots of gene names/product deflines that need to be fixed, this https://github.com/nextgenusfs/funannotate/commit/177cc9983faaa161bea5c734f647ddc9d76691ad should fix that. Since you are one of the first to annotate a plant with the gene2products database, there are a lot of names/products to add to the DB and then also many that need to be curated. I've got the scripts setup to alert the user which ones need to be fixed, which ones passed but need curation, and then those that have passed but not in database.

nextgenusfs commented 6 years ago

Sorry, found one more bug in the antiSMASH output --> was repeating same for output for sec met backbone names, the parser was nested inside the wrong loop in the code. https://github.com/nextgenusfs/funannotate/commit/88c1c18e423c2940c108ec3d1039be9788467f8d. I apparently never saw this before as the fungismash backbone enzymes had more annotation associated with them.

gskrasnov commented 6 years ago

First about the Phobius/SignalP, it wasn't able to find them in your path, so just need to install each tool and put in your PATH.

Yes, I do not have these tools installed. However, I had previously run funannotate remote for Phobius service and funannotate did find Phobius results:

2018-01-12 19:55:41,567: Found phobius pre-computed results
2018-01-12 19:55:41,592: Skipping secretome: neither SignalP nor Phobius installed
2018-01-12 19:55:41,592: 0 secretome and 0 transmembane annotations added

It looks like proteinortho5 tool failed, any clues in the log file?

Here is funannotate compare log file:

2018-01-12 20:45:16,152: /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-compare.py --cpus 60 --database /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/funannotate.DB --out 901.903.compare -i /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod /mnt/raid/illumina/geo/Flax-2017.Annotation.903/funannotate/903.fun_out.wo.RepMod

2018-01-12 20:45:16,170: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
2018-01-12 20:45:16,342: Running funannotate v1.0.2
2018-01-12 20:45:16,597: Input files/folders: ['/mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod', '/mnt/raid/illumina/geo/Flax-2017.Annotation.903/funannotate/903.fun_out.wo.RepMod']

2018-01-12 20:45:16,597: Now parsing 2 genomes
2018-01-12 20:45:47,152: working on Linum usitatissimum
2018-01-12 20:48:52,954: working on Linum usitatissimum
2018-01-12 20:51:27,344: Summarizing secondary metabolism gene clusters
2018-01-12 20:51:29,925: Summarizing PFAM domain results
2018-01-12 20:51:31,016: Summarizing InterProScan results
2018-01-12 20:51:33,099: Loading InterPro descriptions
2018-01-12 20:51:39,874: Summarizing MEROPS protease results
2018-01-12 20:51:40,143: found 13/93 MEROPS familes with stdev >= 1.000000
2018-01-12 20:51:40,558: Summarizing CAZyme results
2018-01-12 20:51:40,815: found 41/148 CAZy familes with stdev >= 1.000000
2018-01-12 20:51:41,543: Summarizing COG results
2018-01-12 20:51:42,045: SignalP raw data:
[{}, {}]
2018-01-12 20:51:42,045: No SignalP annotations found
2018-01-12 20:51:42,045: [{}, {}]

2018-01-12 20:51:42,045: Summarizing fungal transcription factors
2018-01-12 20:51:43,391: Running GO enrichment for each genome
2018-01-12 20:51:43,392: find_enrichment.py --obo /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/funannotate.DB/go.obo --pval 0.001 --alpha 0.001 --method fdr 901.903.compare/go_terms/Linum_usitatissimum.txt 901.903.compare/go_terms/population.txt 901.903.compare/go_terms/associations.txt
2018-01-12 20:55:55,313: Running orthologous clustering tool, ProteinOrtho5.  This may take awhile...
2018-01-12 20:55:55,314: proteinortho5.pl -project=funannotate -synteny -cpus=60 -singles -selfblast Linum_usitatissimum.faa Linum_usitatissimum.faa
2018-01-12 20:55:55,828: *****************************************************************
Proteinortho with PoFF version 5.15 - An orthology detection tool
*****************************************************************
Using 60 CPU threads, Detected NCBI BLAST version 2.6.0+
Checking input files
Error: Gene ID 'FLN_048020' is defined at least twice:
Linum_usitatissimum.faa
Linum_usitatissimum.faa

I think the error may come from the fact that I compare two genomes from the same species (two different cultivars, 901 and 903). Both fasta files may be named like Linum_usitatissimum.fa Maybe I should define the species more precisely (e.g. "Linum usitatissimum 901") in GBK files and provide just them?...

nextgenusfs commented 6 years ago

If you just delete the phobius files from annotate_misc it will run it locally. which signalp works on the command line? It doesn't look like it tried to run it from log file. You can re-run funannotate annotate and add --isolate or --strain to differentiate the naming scheme.

gskrasnov commented 6 years ago

Thank you. I will try to get SignalP standalone version

When I provided --phobius ./phobius.results.txt (with the updated 88c1c18 ) it worked well:

2018-01-13 08:20:24,227: Found phobius pre-computed results
2018-01-13 08:20:24,254: SignalP not installed, secretome prediction less accurate using only Phobius
2018-01-13 08:20:25,770: 7,635 secretome and 11,157 transmembane annotations added

but when I did not do this (in the previous runs), funannotate did find Phobius results (they were derived with funannotate remote) in annotate_misc/phobius.results.txt but did not add Phobuis annotations:

2018-01-12 19:55:20,092: Found phobius pre-computed results
2018-01-12 19:55:20,117: Skipping secretome: neither SignalP nor Phobius installed
2018-01-12 19:55:20,117: 0 secretome and 0 transmembane annotations added

I tried --isolate 901, and this helped!

fmobegi commented 3 years ago

AntiSMASH returns an empty file even when the results show regions detected to be secondary metabolite biosynthesis genes.


[01:27 PM]: CMD ERROR: diamond blastp --sensitive --query /ppgdata/fredrick/assembly_data/ascochyta/FINAL_Assemblies/genome_annotation/funannotate_output/ME14/annotate_misc/antismash/smcluster.proteins.fasta --threads 12 --out /ppgdata/fredrick/assembly_data/ascochyta/FINAL_Assemblies/genome_annotation/funannotate_output/ME14/annotate_misc/antismash/smcluster.MIBiG.blast.txt --db /home/fredrick/funannotate_db/mibig.dmnd --max-hsps 1 --evalue 0.001 --max-target-seqs 1 --outfmt 6
b'diamond v0.9.26.127 | by Benjamin Buchfink <buchfink@gmail.com>\nLicensed under the GNU GPL <https://www.gnu.org/licenses/gpl.txt>\nCheck http://github.com/bbuchfink/diamond for updates.\n\n#CPU threads: 12\nScoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)\nTemporary directory: /ppgdata/fredrick/assembly_data/ascochyta/FINAL_Assemblies/genome_annotation/funannotate_output/ME14/annotate_misc/antismash\nOpening the database...  [4.4e-05s]\n#Target sequences to report alignments for: 1\nOpening the input file...  [8.6e-05s]\nError: Error detecting input file format. First line seems to be blank.\n'```