nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

phobius: list index out of range #259

Closed estolle closed 5 years ago

estolle commented 5 years ago

Hi there

I am having some issue with an out of range error with the phobius output or during merging phobius/signalp outputs. Its proceeding, but due to the error, neither the phobius, not the signalp annotations are passed on. Do you by chance have an idea how to circumvent this error?

These are the two errors (commands below):

File "/opt/funnotate/funannotate-1.5.1/bin/funannotate-functional.py", line 826, in <module>
    lib.parsePhobiusSignalP(phobius_out, signalp_out, membrane_out, secreted_out)
  File "/opt/funnotate/funannotate-1.5.1/lib/library.py", line 3881, in parsePhobiusSignalP
    if int(cols[1]) > 0: #then found TM domain

if I let funannotate run phobius the error looks like this:

File "/opt/funnotate/funannotate-1.5.1/util/phobius-multiproc.py", line 84, in <module>
    result = line[1].split(' ')

My commands:

funannotate annotate --input funannotate.test5 --cpus 100 \
   --busco_db hymenoptera \
   --iprscan funannotate.test5/interproscan.output.xml \
   --phobius funannotate.test5/update_results/phobius.short.txt \
   --force --species "Euglossa viridissima"
[04:41 PM]: OS: linux2, 112 cores, ~ 528 GB RAM. Python: 2.7.12
[04:41 PM]: Running funannotate v1.5.0
[04:41 PM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[04:41 PM]: Output directory funannotate.test5 already exists, will use any existing data.  If this is not what you want, exit, and provide a unique name for output folder
[04:41 PM]: Parsing input files
[04:41 PM]: Existing tbl found: funannotate.test5/update_results/Euglossa_viridissima.tbl
[04:42 PM]: Adding Functional Annotation to Euglossa viridissima, NCBI accession: None
[04:42 PM]: Annotation consists of: 32,260 gene models
[04:42 PM]: 32,237 protein records loaded
[04:42 PM]: Existing Pfam-A results found: funannotate.test5/annotate_misc/annotations.pfam.txt
[04:42 PM]: 2,636 annotations added
[04:42 PM]: Running Diamond blastp search of UniProt DB version 2018_11
[04:42 PM]: 946 valid gene/product annotations from 1,659 total
[04:42 PM]: Existing Eggnog-mapper results found: funannotate.test5/annotate_misc/eggnog.emapper.annotations
[04:42 PM]: Parsing EggNog Annotations
[04:42 PM]: 22,571 COG and EggNog annotations added
[04:42 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.29
[04:42 PM]: 2,097 gene name and product description annotations added
[04:42 PM]: Existing MEROPS results found: funannotate.test5/annotate_misc/annotations.merops.txt
[04:42 PM]: 445 annotations added
[04:42 PM]: Existing CAZYme results found: funannotate.test5/annotate_misc/annotations.dbCAN.txt
[04:42 PM]: 325 annotations added
[04:42 PM]: Existing BUSCO2 results found: funannotate.test5/annotate_misc/annotations.busco.txt
[04:42 PM]: 3,764 annotations added
[04:42 PM]: Existing Phobius results found: funannotate.test5/annotate_misc/phobius.results.txt
[04:42 PM]: Predicting secreted proteins with SignalP
Traceback (most recent call last):
  File "/opt/funnotate/funannotate-1.5.1/bin/funannotate-functional.py", line 826, in <module>
    lib.parsePhobiusSignalP(phobius_out, signalp_out, membrane_out, secreted_out)
  File "/opt/funnotate/funannotate-1.5.1/lib/library.py", line 3881, in parsePhobiusSignalP
    if int(cols[1]) > 0: #then found TM domain
IndexError: list index out of range

if I re-run and let funannotate run phobius, then its the same error. mv funannotate.test5/annotate_misc/phobius.results.txt funannotate.test5/annotate_misc/phobius.results.txt.old

rerun without --phobius funannotate.test5/update_results/phobius.short.txt

funannotate annotate --input funannotate.test5 --cpus 100 --busco_db hymenoptera --iprscan funannotate.test5/interproscan.output.xml --force --species "Euglossa viridissima"

[05:13 PM]: Predicting secreted and transmembrane proteins using Phobius
Traceback (most recent call last):
  File "/opt/funnotate/funannotate-1.5.1/util/phobius-multiproc.py", line 84, in <module>
    result = line[1].split(' ')
IndexError: list index out of range
[05:13 PM]: Existing SignalP results found: funannotate.test5/annotate_misc/signalp.results.txt
[05:13 PM]: 0 secretome and 0 transmembane annotations added

.... it proceeds from here

estolle commented 5 years ago

I further tried this: removed the phobius output file and executable and re-ran the last command:

now the signalp information are taken, although only "secretome" (the signalp file has 2365 "SignalP-TM" entries, but only 1,306 secretome are used it seems):

[06:02 PM]: Existing SignalP results found: funannotate.test5/annotate_misc/signalp.results.txt [06:02 PM]: 1,306 secretome and 0 transmembane annotations added

nextgenusfs commented 5 years ago

What does the phobius output file look like - its probably failing and the script is choking on the output? Since funannotate will use any existing datasets, you need to manually remove those that have failed -- looks like you have that figured out with signalP, so same is true for phobius. The reason neither of these tools is "required" is that they require separate licenses and phobius won't run on Mac OSX. You can run phobius using the funannotate remote script that uses the EBI servers. Alternatively as you've figured out you can configure/install locally and it will try to run it if it is found in the PATH.

The local phobius method is not doing anything special, it is simply running in -short mode and saving the results in the annotate_misc folder: https://github.com/nextgenusfs/funannotate/blob/f3e9f3b75fb4bb8450f24c7d14d2c4d985055d95/util/phobius-multiproc.py#L34-L39

estolle commented 5 years ago

thanks for your quick reply. I guess due to the error during the phobius run, the output was empty. If I run it locally with the -short option, then the output looks like this (below). I don't really have XP with how this should look like. I certainly can go ahead with my annotation without phobius is seems

FUN_032294-T1 0 0 o FUN_032295-T1 0 0 o FUN_032296-T1 0 0 o FUN_032297-T1 0 0 o FUN_032298-T1 1 0 i44-63o FUN_032299-T1 0 0 o FUN_032300-T1 0 0 i FUN_032301-T1 0 0 o FUN_032302-T1 1 0 o6-27i FUN_032303-T1 0 0 o FUN_032304-T1 4 0 i426-444o450-468i475-496o508-530i FUN_032305-T1 1 0 i159-183o FUN_032306-T1 0 0 i FUN_032307-T1 1 0 o86-105i FUN_032308-T1 0 0 o FUN_032309-T1 0 Y n5-13c18/19o FUN_032310-T1 0 0 o FUN_032311-T1 1 0 i157-182o FUN_032312-T1 0 0 o FUN_032313-T1 0 0 o FUN_032314-T1 0 Y n8-19c24/25o FUN_032315-T1 0 0 o FUN_032316-T1 0 0 o FUN_032317-T1 0 0 o

alishaquandt commented 5 years ago

Hi Jon, I'm getting this same Phobius error using funannotate annotate. It runs fine through several annotation steps, and then dies here:

[11:54 AM]: 1,209 annotations added [11:54 AM]: Existing Phobius results found: /scratch/summit/caqu8258/Crypto/funannotate_predict_v2/annotate_misc/phobius.results.txt [11:54 AM]: Predicting secreted proteins with SignalP

Traceback (most recent call last): File "/projects/caqu8258/software/build/funannotate/bin/funannotate-functional.py", line 826, in lib.parsePhobiusSignalP(phobius_out, signalp_out, membrane_out, secreted_out) File "/projects/caqu8258/software/build/funannotate/lib/library.py", line 3881, in parsePhobiusSignalP if int(cols[1]) > 0: #then found TM domain IndexError: list index out of range

Here's my code: funannotate annotate -i ./funannotate_predict_v2/ --sbt ./Nagfri.sbt
--antismash ./Naganishia_friedmannii.gbk --phobius ./short_nagfri_phobius_output.txt --iprscan ./Nag_Fri_ipscan_results.xml --busco_db /local/bin/busco/Lineages/basidiomycota_odb9 -t "-l paried ends"

Here's the head of "short" phobius output: SEQENCE ID TM SP PREDICTION NAGFRI_000002-T1 1 0 o97-122i NAGFRI_000003-T1 0 0 o NAGFRI_000004-T1 0 0 o NAGFRI_000005-T1 1 0 o215-236i NAGFRI_000006-T1 0 0 o NAGFRI_000007-T1 0 0 o NAGFRI_000008-T1 0 0 o NAGFRI_000009-T1 0 0 o NAGFRI_000010-T1 0 Y n2-9c14/15o

Signalp is installed and in my path - I've run data with it. Any thoughts?

nextgenusfs commented 5 years ago

Do you have phobius installed on this machine? Does it work if you let funannotate run it? There is something in your format it isn’t expecting I think, will try to find time this weekend to have a look.

alishaquandt commented 5 years ago

I figured it out! Two things were necessary to make it work:

1) Header line in phobius output must be deleted.

2) Phobius output must be converted to tab delimited (as that's how library.py divides the columns). For whatever reason, my output from phobius was not tab delimited. I couldn't find any information in their documentation about how their output should be formatted.

Thanks for your help, Alisha

nextgenusfs commented 5 years ago

Okay, good to know. I will make the parser more flexible.

nextgenusfs commented 5 years ago

Could you try the latest commit with the original phobius input that you had, i.e. the one that failed? I think I fixed the parser, but I don't have an example dataset. So run a git pull and then latest version should be:

$ funannotate version
funannotate v1.6.0-046e957
alishaquandt commented 5 years ago

Hi Jon, I updated and got this: funannotate version funannotate v1.6.0-ac857a9

This seemed to fix the parsing issue, but did not fix the header issue. This is the error I get now: [03:31 PM]: Existing Phobius results found: /scratch/Crypto/funannotate_predict_v2/annotate_misc/phobius.results.txt [03:31 PM]: Predicting secreted proteins with SignalP

Traceback (most recent call last): File "/projects/software/build/funannotate/bin/funannotate-functional.py", line 826, in lib.parsePhobiusSignalP(phobius_out, signalp_out, membrane_out, secreted_out) File "/projects/software/build/funannotate/lib/library.py", line 4385, in parsePhobiusSignalP if int(cols[1]) > 0: #then found TM domain ValueError: invalid literal for int() with base 10: 'ID'

The "ID" it's choking on the is the ID in the Phobius output header line. I hope this helps!

nextgenusfs commented 5 years ago

What does your header look like??

Is it really mispelled???

This is what you posted:

Here's the head of "short" phobius output:
SEQENCE ID TM SP PREDICTION
NAGFRI_000002-T1 1 0 o97-122i
NAGFRI_000003-T1 0 0 o
NAGFRI_000004-T1 0 0 o
NAGFRI_000005-T1 1 0 o215-236i

I assumed that it was actually SEQUENCE ID..... and where did you get this result from btw??

nextgenusfs commented 5 years ago

Try git pull and run again, https://github.com/nextgenusfs/funannotate/commit/34e655f22cb04a51043e839eda1e96218ecfeed2

alishaquandt commented 5 years ago

Okay, that fixed it!