annotate Diamond blastp Uniprot, index out of range - Githubissues

nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline

http://funannotate.readthedocs.io

BSD 2-Clause "Simplified" License

321 stars 85 forks source link

annotate Diamond blastp Uniprot, index out of range #145

Closed AnotherSimon closed 4 years ago

AnotherSimon commented 6 years ago

When running the command: funannotate annotate --input My_bug --sbt template.sbt \ --antismash ./My_bug/annotate_misc/antiSMASH.results.gbk \ --iprscan ./My_bug/annotate_misc/iprscan.xml \ --phobius ./My_bug/annotate_misc/phobius.results.txt \ --cpus 24 The following error occurs:

[92m[11:33:30 AM][0m: OS: linux2, 24 cores, ~ 66 GB RAM. Python: 2.7.9 [92m[11:33:31 AM][0m: Running funannotate v1.1.1 [92m[11:33:32 AM][0m: Output directory My_bug already exists, will use any existing data. If this is not what you want, exit, and provide a unique name for output folder [92m[11:33:32 AM][0m: Parsing input files [92m[11:33:39 AM][0m: Adding Functional Annotation to My bug, NCBI accession: None [92m[11:33:39 AM][0m: Annotation consists of: 8,224 gene models [92m[11:33:39 AM][0m: 8,166 protein records loaded [92m[11:33:40 AM][0m: Running HMMer search of PFAM version 31.0 [92m[11:37:17 AM][0m: 2,015 annotations added [92m[11:37:17 AM][0m: Running Diamond blastp search of UniProt DB version 2018_01 Traceback (most recent call last): File "/data1/home/simon/software/funannotate/bin/funannotate-functional.py", line 617, in SwissProtBlast(Proteins, args.cpus, 1e-5, os.path.join(outputdir, 'annotate_misc'), GeneProducts) File "/data1/home/simon/software/funannotate/bin/funannotate-functional.py", line 237, in SwissProtBlast name = description[2].replace(' PE','').upper() IndexError: list index out of range

For completeness I should mention that the remote annotation finished fine in all three cases but there was an issue in writing the log files for antiSMASH and IPRscan respectively:

[92m[12:35:39 PM][0m: Results GBK: My_bug/annotate_misc/antiSMASH.results.gbk [92m[12:35:39 PM][0m: Remote searches complete Downloading: https://fungismash.secondarymetabolites.org/upload/fungi-ec57c32d-.../scaffold_1.zip Bytes: 29304372 8192 [0.03%]16384 [0.06%]24576 [0.08%]32768 ... [100.00%]Traceback (most recent call last): File "/home/simon/software/funannotate/bin/funannotate-remote.py", line 301, in os.rename(log_name, os.path.join(outputdir, 'logfiles', log_name)) OSError: [Errno 2] No such file or directory

Progress: 99.99% Progress: 99.99% [92m[02:56:44 AM][0m: Remote searches complete Traceback (most recent call last): File "/home/simon/software/funannotate/bin/funannotate-remote.py", line 301, in os.rename(log_name, os.path.join(outputdir, 'logfiles', log_name)) OSError: [Errno 2] No such file or directory

nextgenusfs commented 6 years ago

What version of diamond? https://github.com/nextgenusfs/funannotate/issues/135

nextgenusfs commented 6 years ago

I think that UniProt changed their format for FASTA deflines..... which offset the name/description parser. working on a fix now.

nextgenusfs commented 6 years ago

This should fix it https://github.com/nextgenusfs/funannotate/commit/11c374afa1832228cb872d2d4f285a3a41fba9a5

AnotherSimon commented 6 years ago

Think the fix works. However I got 0 UniProt hits, is that suspicious?

[09:59:05 AM]: Running Diamond blastp search of UniProt DB version 2018_02 [09:59:09 AM]: 0 valid gene/product annotations from 882 total [09:59:11 AM]: Running Eggnog-mapper ...

nextgenusfs commented 6 years ago

Yeah that doesn’t seem right. There should be fewer than the total but not zero.

AnotherSimon commented 6 years ago

I seem to remember UniProt being slightly larger than 882 curated genes. Or is this a particular subset defined by some filtering criteria?

nextgenusfs commented 6 years ago

Yeah, so 882 hits that are > 60% identical and over 60% of the length of the protein, and then they are further filtered for which hits have "proper" gene names and descriptions. Some are not curated very well and don't have a gene name and aren't useful. But more stuff should be passing here, anyway to send me the uniprot.xml in your annotate_misc folder?

AnotherSimon commented 6 years ago

Our IT is pretty strict on outward facing sites so I'll send to you by your gmail address.

nextgenusfs commented 6 years ago

Hmmm, this is the default database installed by funannotate correct? I guess then it must mean that older versions of are doing something different with the deflines and it isn't being parsed correctly.

Here is what your data looks like for a hit:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastp</BlastOutput_program>
  <BlastOutput_version>diamond 0.8.22</BlastOutput_version>
  <BlastOutput_reference>Benjamin Buchfink, Xie Chao, and Daniel Huson (2015), &quot;Fast and sensitive protein alignment using DIAMOND&quot;, Nature Methods 12:59-60.</BlastOutput_reference>
  <BlastOutput_db></BlastOutput_db>
...
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gnl|BL_ORD_ID|421135</Hit_id>
  <Hit_def>sp|O13882|RT18_SCHPO</Hit_def> 
  <Hit_accession>421135</Hit_accession>
  <Hit_len>223</Hit_len>

This is what the script is expecting.

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastp</BlastOutput_program>
  <BlastOutput_version>diamond 0.9.14</BlastOutput_version>
  <BlastOutput_reference>Benjamin Buchfink, Xie Chao, and Daniel Huson (2015), &quot;Fast and sensitive protein alignment using DIAMOND&quot;, Nature Methods 12:59-60.</BlastOutput_reference>
  <BlastOutput_db>/usr/local/share/funannotate/uniprot.dmnd</BlastOutput_db>
...
Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>sp|Q2GWZ4|CFD1_CHAGB</Hit_id>
  <Hit_def>Cytosolic Fe-S cluster assembly factor CFD1 OS=Chaetomium globosum (strain ATCC 6205 / CBS 148.51 / DSM 1962 / NBRC 6347 / NRRL 1970) OX=306901 GN=CFD1 PE=3 SV=1</Hit_def>
  <Hit_accession>Q2GWZ4</Hit_accession>
  <Hit_len>303</Hit_len>

I think that diamond databases < v0.8 are not compatible with v0.9 and greater. So it may fix it to upgrade diamond to a newer version (although there are several recent versions where the XML format is broken see #135), but you will also have to re-run funannotate setup to generate the updated diamond databases. Otherwise I will have to install an older version here locally and see if that yields the same results.

I'm a little bit surprised that the hit-ids say gnl instead of the sp prefix --> must be something hard coded in older version of diamond.

AnotherSimon commented 6 years ago

Updated diamond to 0.9.18, manually deleted all files in $FUNANNOTATE_DB, ran funannotate setup again. (funnanotate v1.1.1) Also had to download eggnog databases again with -f option because they were incompatible with newer diamond version.

UniProt seems to pass muster now. However I observed some strange behavior where the results file from remote Phobius seems to disappear when calling funannotate annotate. Trying to reproduce this behavior now.

AnotherSimon commented 6 years ago

My phobius results are definitely getting deleted by funannotate annotate. I keep a renamed copy as a work-around in the mean while. Here's the log:

2018-03-05 10:51:34,208: Running Eggnog-mapper 2018-03-05 10:51:34,208: emapper.py -m diamond -i .../My_bug/annotate_misc/genome.proteins.fasta -o eggnog --cpu 24 2018-03-05 10:51:35,982: # emapper-1.0.3 ./emapper.py -m diamond -i .../My_bug/annotate_misc/genome.proteins.fasta -o eggnog --cpu 24 [1;33m /home/simon/bin/diamond blastp -d /home/simon/software/eggnog-mapper/data/eggnog_proteins.dmnd -q .../My_bug/annotate_misc/genome.proteins.fasta --more-sensitive --threads 24 -e 0.001000 -o .../My_bug/annotate_misc/emappertmp_dmdn_d2lDJp/36dd9d8b68244edb9a53c02bca1b740e --top 3 [0m 2018-03-05 10:51:35,983: Error: Database was built with a different version of Diamond as is incompatible. Traceback (most recent call last): File "/home/simon/software/eggnog-mapper/emapper.py", line 1001, in main(args) File "/home/simon/software/eggnog-mapper/emapper.py", line 216, in main dump_diamond_matches(args.input, seed_orthologs_file, args) File "/home/simon/software/eggnog-mapper/emapper.py", line 353, in dump_diamond_matches raise e subprocess.CalledProcessError: Command '/home/simon/bin/diamond blastp -d /home/simon/software/eggnog-mapper/data/eggnog_proteins.dmnd -q .../My_bug/annotate_misc/genome.proteins.fasta --more-sensitive --threads 24 -e 0.001000 -o .../My_bug/annotate_misc/emappertmp_dmdn_d2lDJp/36dd9d8b68244edb9a53c02bca1b740e --top 3' returned non-zero exit status 1

2018-03-05 10:51:35,984: No Eggnog-mapper results found. 2018-03-05 10:51:35,984: Combining UniProt/EggNog gene and product names using Gene2Product version 1.4 2018-03-05 10:51:36,298: 653 gene name and product description annotations added 2018-03-05 10:51:36,298: Running Diamond blastp search of MEROPS version 12.0 2018-03-05 10:51:36,322: 282 annotations added 2018-03-05 10:51:36,323: Annotating CAZYmes using HMMer search of dbCAN version 6.0 2018-03-05 10:51:36,325: 206 annotations added 2018-03-05 10:51:36,325: Annotating proteins with BUSCO dikarya models 2018-03-05 10:51:36,345: 1,841 annotations added

And the StdOut:

[92m[10:51:36 AM][0m: Running Diamond blastp search of MEROPS version 12.0 [92m[10:51:36 AM][0m: 282 annotations added [92m[10:51:36 AM][0m: Annotating CAZYmes using HMMer search of dbCAN version 6.0 [92m[10:51:36 AM][0m: 206 annotations added [92m[10:51:36 AM][0m: Annotating proteins with BUSCO dikarya models [92m[10:51:36 AM][0m: 1,841 annotations added

Traceback (most recent call last): File "/home/simon/software/funannotate/bin/funannotate-functional.py", line 767, in shutil.copyfile(args.phobius, phobius_out) File "/home/ppa/software/lib/python2.7/shutil.py", line 82, in copyfile with open(src, 'rb') as fsrc: IOError: [Errno 2] No such file or directory: './My_bug/annotate_misc/phobius.results.txt'

So it's partially my fault for not properly reinstalling eggnog databases but I don't see how that should be related to the phobius results getting deleted.

Small update: to get eggnogg working, I had to extract all the fasta sequences from ~/software/eggnog-mapper/data/eggnog_proteins.dmnd with the diamond distribution bundled with eggnogg mapper and then turn it back into a dmnd file with the shiny new diamond v0.9.18 in my PATH. Might be worth mentioning in the install guide that eggnog-diamond versioning can be an issue.

Small update 2: It appears that this error is not unique to phobius but rather all 3 of the remote search results files. The error seems to stem from storing them in the ./My_bug/annotate_misc folder where they are overwritten by the funannotate annotate command. So either the results need to be moved out of this subfolder by the user after funannote remote or an update of the script is in order.