nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

annotate step not combining everything after parsing interproscan XML #938

Open benyoung93 opened 11 months ago

benyoung93 commented 11 months ago

Are you using the latest release? Yep :)

funannotate --version
funannotate v1.8.15

Describe the bug I get the following error message (in logfile section below) when running the annotate step, it is linked to the XML file from InterproScan that I ran locally. Version of interproscan is

interproscan.sh --version
InterProScan version 5.59-91.0
InterProScan 64-Bit build  (requires Java 11)

Here is a head -20 of the xml file from interproscan. I used the proteins.fa from funannotate update as inpur for interproscan.

head -20 Orbicella_faveolata_gen_17.proteins.fa.xml
<?xml version="1.0" encoding="UTF-8"?><protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5" interproscan-version="5.59-91.0">
  <protein>
    <sequence md5="c55943849f18482b8ef6668ca556a43f">MLKAYNAAINLPPLQTPRPRRVQSHIPTYIPAYTSPRIIINVSGMRFETYEETLENYPETLLGSPTRRREYYSSSEGEYVFARDKPSFDAILFFYQSRGILAKPDTVSEETFLEEIEFYGLSSNYYSIHCDDLSSTHEDVEEILPLSPHKRKLWLWFEFPRSSQTARVLALWSIFVIIFSTVVFCIETIPQLAIEPRIIYQNISKDGKYVLEEHKEPTVDYWFVMEGIFVAWFTIEYFVRLYSAPVVWNFVKSTMGVLDVLAIFPFYVTLALQQSTDEVRSFAVLRAIRLFRVLRVFKLSRYSDAIKLLVSTLCSSFEQLKTLGFCFAVSVVIFSSAIFYAEGGSNIPSIPDAFWWTVITMTGVGYGDVTPLTPMGKFVGSFCAMSGIVFFCLPTPVLVSNFIKFYLNYGNLNERKKAFAENLKQLFLRPNK</sequence>
    <xref id="Ofa_008232-T1" name="Ofa_008232-T1 Ofa_008232"/>
    <matches>
      <fingerprints-match evalue="3.0E-78" graphscan="IIIIIIII">
        <signature ac="PR00169" desc="Potassium channel signature" name="KCHANNEL">
          <signature-library-release library="PRINTS" version="42.0"/>
        </signature>
        <model-ac>PR00169</model-ac>
        <locations>
          <fingerprints-location motifNumber="5" pvalue="3.04E-11" score="45.74" start="290" end="316">
            <location-fragments>
              <fingerprints-location-fragment start="290" end="316" dc-status="CONTINUOUS"/>
            </location-fragments>
          </fingerprints-location>
          <fingerprints-location motifNumber="2" pvalue="6.99E-15" score="42.84" start="160" end="188">
            <location-fragments>
              <fingerprints-location-fragment start="160" end="188" dc-status="CONTINUOUS"/>
            </location-fragments>

Here is a grep ">" proteins.fa | head -20 of the proteins fasta from the update command.

>Ofa_000002-T1 Ofa_000002
>Ofa_000003-T1 Ofa_000003
>Ofa_000004-T1 Ofa_000004
>Ofa_000005-T1 Ofa_000005
>Ofa_000006-T1 Ofa_000006
>Ofa_000007-T1 Ofa_000007
>Ofa_000008-T1 Ofa_000008
>Ofa_000009-T1 Ofa_000009
>Ofa_000009-T2 Ofa_000009
>Ofa_000009-T3 Ofa_000009
>Ofa_000009-T4 Ofa_000009
>Ofa_000010-T1 Ofa_000010
>Ofa_000011-T1 Ofa_000011
>Ofa_000012-T1 Ofa_000012
>Ofa_000013-T1 Ofa_000013
>Ofa_000014-T1 Ofa_000014
>Ofa_000015-T1 Ofa_000015
>Ofa_000016-T1 Ofa_000016
>Ofa_000017-T1 Ofa_000017
>Ofa_000018-T1 Ofa_000018

and a head .gff3 from funannotae::update

##gff-version 3
ofavscaf_1  funannotate gene    22362   22434   .   +   .   ID=Ofa_000001;
ofavscaf_1  funannotate tRNA    22362   22434   .   +   .   ID=Ofa_000001-T1;Parent=Ofa_000001;product=tRNA-Gln;
ofavscaf_1  funannotate exon    22362   22434   .   +   .   ID=Ofa_000001-T1.exon1;Parent=Ofa_000001-T1;
ofavscaf_1  funannotate gene    250359  269002  .   +   .   ID=Ofa_000002;
ofavscaf_1  funannotate mRNA    250359  269002  .   +   .   ID=Ofa_000002-T1;Parent=Ofa_000002;product=hypothetical protein;
ofavscaf_1  funannotate five_prime_UTR  250359  250392  .   +   .   ID=Ofa_000002-T1.utr5p1;Parent=Ofa_000002-T1;
ofavscaf_1  funannotate exon    250359  250596  .   +   .   ID=Ofa_000002-T1.exon1;Parent=Ofa_000002-T1;
ofavscaf_1  funannotate exon    254107  254223  .   +   .   ID=Ofa_000002-T1.exon2;Parent=Ofa_000002-T1;
ofavscaf_1  funannotate exon    254664  254843  .   +   .   ID=Ofa_000002-T1.exon3;Parent=Ofa_000002-T1;

What command did you issue?

funannotate annotate \ 
-i /scratch/projects/omics/ofav_genome/funannotate/step_5_predict \
--cpus 10 \
--iprscan /scratch/projects/omics/ofav_genome/funannotate/interproscan_res/Orbicella_faveolata_gen_17.proteins.fa.xml \
--species "Orbicella faveolata" \
--strain gen_17 \
--out /scratch/projects/omics/ofav_genome/funannotate/step_7_annotate \
--rename QW917

Logfiles To try and be a wee bit more concise did not put the whole log file in, just a few lines above where the error comes in in my LSF .err file

[Jul 23 05:05 PM]: 514 annotations added
[Jul 23 05:05 PM]: Existing BUSCO2 results found: /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/annotations.busco.txt
[Jul 23 05:05 PM]: 1,049 annotations added
[Jul 23 05:05 PM]: Skipping phobius predictions, try funannotate remote -m phobius
[Jul 23 05:05 PM]: Skipping secretome: neither SignalP nor Phobius searches were run
[Jul 23 05:05 PM]: 0 secretome and 0 transmembane annotations added
[Jul 23 05:05 PM]: Parsing InterProScan5 XML file
Traceback (most recent call last):
  File "/nethome/bdy8/mambaforge/envs/funannotate_env/bin/funannotate", line 10, in <module>
    sys.exit(main())
  File "/nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/funannotate.py", line 716, in main
    mod.main(arguments)
  File "/nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/annotate.py", line 1458, in main
    GeneNames = lib.getGeneBasename(Proteins)
  File "/nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/library.py", line 1081, in getGeneBasename
    transcript, gene = line.split(" ")
ValueError: too many values to unpack (expected 2)

and here is the entirety of the log file from the annotate command. I notice I was stupid and need to reset the busco, will do that in a later run once this error is fixed.

[07/23/23 08:49:42]: /nethome/bdy8/mambaforge/envs/funannotate_env/bin/funannotate annotate -i /scratch/projects/omics/ofav_genome/funannotate/step_5_predict --cpus 10 --iprscan /scratch/projects/omics/ofav_g
enome/funannotate/interproscan_res/Orbicella_faveolata_gen_17.proteins.fa.xml --species Orbicella faveolata --strain gen_17 --out /scratch/projects/omics/ofav_genome/funannotate/step_7_annotate --rename QW917

[07/23/23 08:49:42]: OS: CentOS Linux 7, 16 cores, ~ 264 GB RAM. Python: 3.8.15
[07/23/23 08:49:42]: Running 1.8.15
[07/23/23 08:49:43]: hmmscan version=HMMER 3.3.2 (Nov 2020) path=/nethome/bdy8/mambaforge/envs/funannotate_env/bin/hmmscan
[07/23/23 08:49:43]: hmmsearch version=HMMER 3.3.2 (Nov 2020) path=/nethome/bdy8/mambaforge/envs/funannotate_env/bin/hmmsearch
[07/23/23 08:49:43]: diamond version=2.1.7 path=/nethome/bdy8/mambaforge/envs/funannotate_env/bin/diamond
[07/23/23 08:49:43]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[07/23/23 08:49:43]: Parsing input files
[07/23/23 08:49:43]: Existing tbl found: /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/update_results/Orbicella_faveolata_gen_17.tbl
[07/23/23 08:50:20]: TBL file: /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/genome.tbl
[07/23/23 08:50:20]: GFF3 file: /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/update_results/Orbicella_faveolata_gen_17.gff3
[07/23/23 08:50:20]: Proteins file: /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/genome.proteins.fasta
[07/23/23 08:52:21]: Adding Functional Annotation to Orbicella faveolata, NCBI accession: None
[07/23/23 08:52:21]: Annotation consists of: 35,821 gene models
[07/23/23 08:52:21]: 32,172 protein records loaded
[07/23/23 08:52:22]: Running HMMer search of PFAM version 35.0
[07/23/23 09:07:58]: 39,649 annotations added
[07/23/23 09:07:58]: Running Diamond blastp search of UniProt DB version 2023_02
[07/23/23 09:07:58]: diamond blastp --sensitive --query /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/genome.proteins.fasta --threads 10 --out /scratch/projects/omics/ofav_genom
e/funannotate/step_5_predict/annotate_misc/uniprot.xml --db /scratch/projects/omics/ofav_genome/funannotate_db/uniprot.dmnd --evalue 1e-05 --max-target-seqs 1 --outfmt 5
[07/23/23 09:09:16]: 1,409 valid gene/product annotations from 1,996 total
[07/23/23 09:09:17]: Running Eggnog-mapper
[07/23/23 09:09:18]: emapper.py -m diamond -i /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/genome.proteins.fasta -o eggnog --cpu 10 --scratch_dir /tmp/emapper-3db02fb5 --temp_d
ir /tmp --dbmem
[07/23/23 09:55:55]: #  emapper-2.1.11
# emapper.py  -m diamond -i /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/genome.proteins.fasta -o eggnog --cpu 10 --scratch_dir /tmp/emapper-3db02fb5 --temp_dir /tmp --dbmem
ESC[1;33m  /nethome/bdy8/mambaforge/envs/funannotate_env/bin/diamond blastp -d '/scratch/projects/omics/ofav_genome/eggnog_db/eggnog_proteins.dmnd' -q '/scratch/projects/omics/ofav_genome/funannotate/step_5_p
redict/annotate_misc/genome.proteins.fasta' --threads 10 -o '/tmp/emapper-3db02fb5/eggnog.emapper.hits' --tmpdir '/tmp/emappertmp_dmdn_xbd2svil' --sensitive --iterate -e 0.001 --top 3  --outfmt 6 qseqid sseqi
d pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhspESC[0m
Loading source DB...
ESC[31mWarning: this can take a few minutes and load up to 45GB to RAM. Using --dbmem is recommended to annotate a large number of sequences.ESC[0m
 Copying result file /tmp/emapper-3db02fb5/eggnog.emapper.hits from scratch to /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc
 Copying result file /tmp/emapper-3db02fb5/eggnog.emapper.seed_orthologs from scratch to /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc
 Copying result file /tmp/emapper-3db02fb5/eggnog.emapper.annotations from scratch to /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc
ESC[31mData in /tmp/emapper-3db02fb5 will be not removed. Please, clear it manually.ESC[0m
ESC[32mDoneESC[0m
ESC[1;33mResult files:ESC[0m
   /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/eggnog.emapper.hits
   /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/eggnog.emapper.seed_orthologs
   /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/eggnog.emapper.annotations
Total hits processed: 24419
Total time: 2797 secs
FINISHED

[07/23/23 09:55:55]: ESC[1;32mFunctional annotation of hits...ESC[0m
ESC[1;34mTime to load the DB into memory: 98.14618587493896ESC[0m
500 102.76057195663452 4.87 q/s (% mem usage: 20.80, % mem avail: 79.22)
1000 105.52455973625183 9.48 q/s (% mem usage: 20.80, % mem avail: 79.22)
1500 108.07541990280151 13.88 q/s (% mem usage: 20.80, % mem avail: 79.22)
2000 110.64597415924072 18.08 q/s (% mem usage: 20.80, % mem avail: 79.22)
2500 113.56916570663452 22.01 q/s (% mem usage: 20.80, % mem avail: 79.22)
3000 115.83877038955688 25.90 q/s (% mem usage: 20.80, % mem avail: 79.22)
3500 118.33911061286926 29.58 q/s (% mem usage: 20.80, % mem avail: 79.21)
4000 121.04290747642517 33.05 q/s (% mem usage: 20.80, % mem avail: 79.21)
4500 123.74869704246521 36.36 q/s (% mem usage: 20.80, % mem avail: 79.21)
5000 125.81996965408325 39.74 q/s (% mem usage: 20.80, % mem avail: 79.21)
5500 128.26136589050293 42.88 q/s (% mem usage: 20.80, % mem avail: 79.21)
6000 130.87990927696228 45.84 q/s (% mem usage: 20.80, % mem avail: 79.21)
6500 133.0001757144928 48.87 q/s (% mem usage: 20.80, % mem avail: 79.21)
7000 135.80155849456787 51.55 q/s (% mem usage: 20.80, % mem avail: 79.21)
7500 138.11626362800598 54.30 q/s (% mem usage: 20.80, % mem avail: 79.21)
8000 139.9871528148651 57.15 q/s (% mem usage: 20.80, % mem avail: 79.21)
8500 141.91094136238098 59.90 q/s (% mem usage: 20.80, % mem avail: 79.21)
9000 144.62061476707458 62.23 q/s (% mem usage: 20.80, % mem avail: 79.21)
9500 147.17864990234375 64.55 q/s (% mem usage: 20.80, % mem avail: 79.21)
10000 149.85912084579468 66.73 q/s (% mem usage: 20.80, % mem avail: 79.21)
10500 152.38914847373962 68.90 q/s (% mem usage: 20.80, % mem avail: 79.21)
11000 154.8997724056244 71.01 q/s (% mem usage: 20.80, % mem avail: 79.21)
11500 157.74931120872498 72.90 q/s (% mem usage: 20.80, % mem avail: 79.21)
12000 160.33052897453308 74.85 q/s (% mem usage: 20.80, % mem avail: 79.21)
12500 163.33509373664856 76.53 q/s (% mem usage: 20.80, % mem avail: 79.21)
13000 166.48512387275696 78.09 q/s (% mem usage: 20.80, % mem avail: 79.21)
13500 169.40334510803223 79.69 q/s (% mem usage: 20.80, % mem avail: 79.21)
14000 172.24742674827576 81.28 q/s (% mem usage: 20.80, % mem avail: 79.21)
14500 175.25630068778992 82.74 q/s (% mem usage: 20.80, % mem avail: 79.21)
15000 177.9720311164856 84.28 q/s (% mem usage: 20.80, % mem avail: 79.21)
15500 180.87293982505798 85.70 q/s (% mem usage: 20.80, % mem avail: 79.21)
16000 183.9645323753357 86.97 q/s (% mem usage: 20.80, % mem avail: 79.21)
16500 186.56238341331482 88.44 q/s (% mem usage: 20.80, % mem avail: 79.21)
17000 189.0217046737671 89.94 q/s (% mem usage: 20.80, % mem avail: 79.21)
17500 191.85257124900818 91.22 q/s (% mem usage: 20.80, % mem avail: 79.21)
18000 194.47047901153564 92.56 q/s (% mem usage: 20.80, % mem avail: 79.21)
18500 197.2026743888855 93.81 q/s (% mem usage: 20.80, % mem avail: 79.21)
19000 200.09178280830383 94.96 q/s (% mem usage: 20.80, % mem avail: 79.21)
19500 202.64678502082825 96.23 q/s (% mem usage: 20.80, % mem avail: 79.21)
20000 205.84024310112 97.16 q/s (% mem usage: 20.80, % mem avail: 79.21)
20500 208.7857689857483 98.19 q/s (% mem usage: 20.80, % mem avail: 79.21)
21000 210.98079323768616 99.54 q/s (% mem usage: 20.80, % mem avail: 79.21)
21500 213.35583424568176 100.77 q/s (% mem usage: 20.80, % mem avail: 79.21)
22000 216.3390507698059 101.69 q/s (% mem usage: 20.80, % mem avail: 79.21)
22500 218.85862636566162 102.81 q/s (% mem usage: 20.80, % mem avail: 79.21)
23000 221.23084926605225 103.96 q/s (% mem usage: 20.80, % mem avail: 79.21)
23500 222.8097629547119 105.47 q/s (% mem usage: 20.80, % mem avail: 79.21)
24000 224.27410769462585 107.01 q/s (% mem usage: 20.80, % mem avail: 79.21)
24419 249.12369871139526 98.02 q/s (% mem usage: 4.40, % mem avail: 95.63)

[07/23/23 09:55:55]: Parsing EggNog Annotations
[07/23/23 09:55:55]: EggNog version parsed as 2.1.11
[07/23/23 09:55:55]: EggNog annotation detected as emapper v2.1.11 and DB prefix ENOG50
[07/23/23 09:55:57]: 52,099  COG and EggNog annotations added
[07/23/23 09:55:57]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.91
[07/23/23 09:55:59]: 10,109 gene name and product description annotations added
[07/23/23 09:55:59]: Running Diamond blastp search of MEROPS version 12.0
[07/23/23 09:55:59]: diamond blastp --sensitive --query /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/genome.proteins.fasta --threads 10 --out /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/merops.xml --db /scratch/projects/omics/ofav_genome/funannotate_db/merops.dmnd --evalue 1e-05 --max-target-seqs 1 --outfmt 5
[07/23/23 09:56:04]: 1,151 annotations added
[07/23/23 09:56:04]: Annotating CAZYmes using HMMer search of dbCAN version 11.0
[07/23/23 10:02:23]: 514 annotations added
[07/23/23 10:02:23]: Annotating proteins with BUSCO dikarya models
[07/23/23 10:02:23]: /nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/genome.proteins.fasta -m proteins -l /scratch/projects/omics/ofav_genome/funannotate_db/dikarya -o busco -c 10 -f
[07/23/23 10:04:20]: 
INFO    ****************** Start a BUSCO 2.0 analysis, current time: 07/23/2023 10:02:23 ******************
INFO    The lineage dataset is: dikarya_odb9 (eukaryota)
INFO    Mode is: proteins
INFO    To reproduce this run: python /nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/genome.proteins.fasta -o busco -l /scratch/projects/omics/ofav_genome/funannotate_db/dikarya/ -m proteins -c 10 -sp aspergillus_nidulans
INFO    Check dependencies...
INFO    Check input file...
INFO    Temp directory is ./tmp/
INFO    Running HMMER on the proteins:
INFO    07/23/2023 10:02:24 =>  0% of predictions performed (1312 to be done)
INFO    07/23/2023 10:02:51 =>  10% of predictions performed (133/1312 candidate proteins)
INFO    07/23/2023 10:03:03 =>  20% of predictions performed (263/1312 candidate proteins)
INFO    07/23/2023 10:03:16 =>  30% of predictions performed (394/1312 candidate proteins)
INFO    07/23/2023 10:03:26 =>  40% of predictions performed (526/1312 candidate proteins)
INFO    07/23/2023 10:03:35 =>  50% of predictions performed (657/1312 candidate proteins)
INFO    07/23/2023 10:03:43 =>  60% of predictions performed (788/1312 candidate proteins)
INFO    07/23/2023 10:03:48 =>  70% of predictions performed (919/1312 candidate proteins)
INFO    07/23/2023 10:03:55 =>  80% of predictions performed (1050/1312 candidate proteins)
INFO    07/23/2023 10:04:01 =>  90% of predictions performed (1181/1312 candidate proteins)
INFO    07/23/2023 10:04:18 =>  100% of predictions performed
INFO    Results:
INFO    C:67.3%[S:58.9%,D:8.4%],F:12.1%,M:20.6%,n:1312
INFO    883 Complete BUSCOs (C)
INFO    773 Complete and single-copy BUSCOs (S)
INFO    110 Complete and duplicated BUSCOs (D)
INFO    159 Fragmented BUSCOs (F)
INFO    270 Missing BUSCOs (M)
INFO    1312 Total BUSCO groups searched

INFO    BUSCO analysis done. Total running time: 116.86803841590881 seconds
INFO    Results written in /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/run_busco/

[07/23/23 10:04:20]: 1,049 annotations added
[07/23/23 10:04:20]: Skipping phobius predictions, try funannotate remote -m phobius
[07/23/23 10:04:20]: Skipping secretome: neither SignalP nor Phobius searches were run
[07/23/23 10:04:20]: 0 secretome and 0 transmembane annotations added
[07/23/23 10:04:21]: Parsing InterProScan5 XML file
[07/23/23 10:04:21]: /nethome/bdy8/mambaforge/envs/funannotate_env/bin/python /nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/aux_scripts/iprscan2annotations.py /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/iprscan.xml /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/annotations.iprscan.txt

OS/Install Information

You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.046 DBI: 1.643 DB_File: 1.858 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.17 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/scratch/projects/omics/ofav_genome/funannotate_db $PASAHOME=/nethome/bdy8/mambaforge/envs/funannotate_env/opt/pasa-2.5.2 $TRINITY_HOME=/nethome/bdy8/mambaforge/envs/funannotate_env/opt/trinity-2.8.5 $EVM_HOME=/nethome/bdy8/mambaforge/envs/funannotate_env/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/nethome/bdy8/mambaforge/envs/funannotate_env/config/ $GENEMARK_PATH=/nethome/bdy8/mambaforge/envs/funannotate_env/opt/gmes_linux_64 All 6 environmental variables are set

Checking external dependencies... PASA: 2.5.2 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.31.0 blat: BLAT v37x1 diamond: 2.1.7 emapper.py: 2.1.11 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2023-04-28 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.520 (2023/Mar/22) makeblastdb: makeblastdb 2.14.0+ minimap2: 2.26-r1175 pigz: 2.6 proteinortho: 6.2.3 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.16.1 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.6 (May 2020) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.14.0+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: gmes_petap.pl not installed ERROR: signalp not installed

benyoung93 commented 11 months ago

been doing a little more sleuthing through log files. In the funannotate-annotate.bf6cb020.log it shows the XML command that is being run

/nethome/bdy8/mambaforge/envs/funannotate_env/bin/python /nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/aux_scripts/iprscan2annotations.py /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/iprscan.xml /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/annotations.iprscan.txt

So interestingly i went to look at the annotations.iprscan.txt that the output should be written to for this step, and the file is not empty. with a head

Ofa_008232-T1   db_xref InterPro:IPR003968
Ofa_008232-T1   db_xref InterPro:IPR003972
Ofa_008232-T1   db_xref InterPro:IPR000210
Ofa_008232-T1   db_xref InterPro:IPR005821
Ofa_008232-T1   db_xref InterPro:IPR027359
Ofa_008232-T1   db_xref InterPro:IPR003131
Ofa_008232-T1   db_xref InterPro:IPR011333
Ofa_008232-T1   db_xref InterPro:IPR028325
Ofa_008232-T1   go_process  potassium ion transport|0006813||IEA
Ofa_008232-T1   go_function channel activity|0015267||IEA

and a wc -l of the file shows 535494 lines.

So I am even more stumped now as to why it is not completing past this step. Maybe it is the last line in the XML file ?? Here is a tail -20 of the xml file just in case.

          </panther-location>
        </locations>
      </panther-match>
      <superfamilyhmmer3-match evalue="5.33E-15">
        <signature ac="SSF50978" name="WD40 repeat-like">
          <entry ac="IPR036322" desc="WD40-repeat-containing domain superfamily" name="WD40_repeat_dom_sf" type="HOMOLOGOUS_SUPERFAMILY"/>
          <signature-library-release library="SUPERFAMILY" version="1.75"/>
        </signature>
        <model-ac>0049784</model-ac>
        <locations>
          <superfamilyhmmer3-location hmm-length="340" start="12" end="167">
            <location-fragments>
              <superfamilyhmmer3-location-fragment start="12" end="167" dc-status="CONTINUOUS"/>
            </location-fragments>
          </superfamilyhmmer3-location>
        </locations>
      </superfamilyhmmer3-match>
    </matches>
  </protein>
</protein-matches>
benyoung93 commented 11 months ago

Okay so from even more sleuthing, it seems that the actual problem could be the combining of everything. Running the interproscan parsing script seems to go okay even when run independently (i.e. no errors thrown on terminal).

Additionally, when removing the interproscan results completely, the same problem occurs as stated above (the valueerror one).

So ye still very stumped, any and all help would be extremely appreciated :).

Just some additional notes on my pipeline

benyoung93 commented 11 months ago

Been doing some more checking of my files for all of this. Could it be the input file I used for interproscan?

I used the protein.fasta from update in interproscan, but doing some greps i notice that some of the sequences in the mRNA fasta are not in the proteins.fasta. Thought I would report this as it may be useful information.

While I wait for help I am running the mrna.fasta from update in interproscan so I can then see if that one works in annotate

benyoung93 commented 11 months ago

Okay update number 3 (should I be combining all of these into the top post??).

I did a awk and found that thee is a miscreant locus tag

awk '/^>[^[:space:]]+[[:space:]][^[:space:]]+[[:space:]][^[:space:]]+/'
>Ofa_031543_novel_gene_672_64b6f8d2_novel_gene_ 673_64b6f8d2-T1 Ofa_031543_novel_gene_672_64b6f8d2_novel_gene_ 673_64b6f8d2

Doing a grep of novel I actually have 2 gene names that are not labelled properly. Saying that its probably (?) the one with the whitespaces that is causing the problem. >Ofa_031543_novel_gene_672_64b6f8d2_novel_gene_ 673_64b6f8d2-T1 Ofa_031543_novel_gene_672_64b6f8d2_novel_gene_ 673_64b6f8d2 >Ofa_030816_novel_gene_620_64b6f8d2-T1 Ofa_030816_novel_gene_620_64b6f8d2

So then I wondered where this happened. Going into my predict_results and running grep "novel" file shows nothing abherrent. Even using the Ofa_031543 and Ofa_030816 shows that these are formatted properly here. BUT, going into the update_results and grep "novel" file shows that these genes now have novel within them an all the other stuff as well. Hmmmmmm. This is in most of the update files.

Is there any idea why this would be happening? Just a parsing issues at some point in the update command??

Im also wondering if there is proper way to fix this rather than going into all the files and fixing it manually, or doing some awks and seds (although that does install a silent terror in my bones). Ive looked at fix but a little confused on the usage.

Thank you for any and all advice on this :).

nextgenusfs commented 11 months ago

Hi @benyoung93. Sorry you ran into this problem. And you've done a great sleuthing job already. Indeed the error appears to be in the parsing the gene names, and for some reason it seems that a handful of gene models from PASA update (funannotate update) appear to not have been processed/renamed properly which is then causing a problem trying to parse their names.

The simplest fix would be to just delete those two problematic gene models (you could do this with funannotate fix). If you however think they are real then you probably don't want to do that. But based on the names that PASA assigned (they are kind of crazy), it appears to be novel genes that it thinks exist based on the transcriptome data. So ideally you'd want the names to follow a numeric progression (ie 00001 --> 00002 --> 000003, etc) as you move along the chromosomes/scaffolds/contigs. However, this isn't necessary at all. So you could also just rename those two gene models with new unique numbers, ie add 1 to the largest locus tag number you have.

And you bring up a challenging point about how to do this, which files, etc. This would appear somewhat tricky, but the good news is that the final annotation files will be the output of funannotate annotate so you just need to fix the files in the update_results folder that annotate will use. So the files to "fix" here are the Genbank, GFF3, and TBL files in the update_results folder -- you just need to fix the IDs for those two gene models. Those three files will get used in annotate to generate the necessary files for adding functional annotation.

It would be nice to figure out why those models from update were not renamed so we could fix the bug -- that would be the other "fix" is if we can get the code updated and you could re-run update it would hopefully rename the models properly. Would you be able to share the update_misc/bestmodels.gff3? I maybe don't need the entire file, but certainly need to understand what the naming of these problematic genes look like in that file. You can email it to me if that is easier. I think the error must be in this region and likely just a format that I've never seen before: https://github.com/nextgenusfs/funannotate/blob/master/funannotate/update.py#L1748-L1761

benyoung93 commented 11 months ago

Good morning @nextgenusfs :).

Thank you for the response :). To answer some queries

So ideally you'd want the names to follow a numeric progression (ie 00001 --> 00002 --> 000003, etc) as you move along the chromosomes/scaffolds/contigs.

This is interesting because these weird names are actually in the correct place I think (i.e. Ofa_030816_novel_gene_620_64b6f8d2-T1 Ofa_030816_novel_gene_620_64b6f8d2 is next to Ofa_030815 and Ofa030817 and on the correct contig, ofavscaf_14). These Ofa_xxxx are then not repeated anywhere, so I just deleted the craziness and kept them as Ofa_XXXXX making sure to be consistent, and add in the relevant -T1 when needed.

So the files to "fix" here are the Genbank, GFF3, and TBL files in the update_results folder -- you just need to fix the IDs for those two gene models. Those three files will get used in annotate to generate the necessary files for adding functional annotation.

Yep this is exactly what I did. I downloaded, fixed in atom, and then reuploaded (while also tar gunzipping the original files to have a record of this). Interproscan has just finished running and I willbe setting off annotate momentarily to see if this fix worked.

It would be nice to figure out why those models from update were not renamed so we could fix the bug -- that would be the other "fix" is if we can get the code updated and you could re-run update it would hopefully rename the models properly. Would you be able to share the update_misc/bestmodels.gff3

I would be more than happy to send this to you :). One query I have here is that I was not able to get mysql onto our HPC cluster (i tried so many things to get it installed, proper channels as well as hacky ways lol), so I had to get the walltime extended for the update command. It ran in 7 days. I think that I can re run the update command and as I have all the files it will just re parse and combine everything (?) so I do not have to do that 7 day wait again. Please let me know if that is right/wrong.

I have sent the files, but I think you may be right. Looking at the bestmodel.gff those two genes are the only ones with the Ofa_xxx and then all the novel after them (ID=Ofa_030816_novel_gene_620_64b6f8d2). All other instances of novel (from my quick skim through) do not have the Ofa_xxx but instead the gene name starts with novel (e.g ID=novel_gene_1388_64b6f8d2_novel_gene_1442_64b6f8d2). So I think the section of code you identified is the right one :).

Thank you for all the help, I will update here if fixing those names allows annotate to complete successfully.

Ben

benyoung93 commented 11 months ago

Also, quick query/enhancement. How possible would it be that if you get the error i did (copied below) to have some sort of awk/sed/grep that prints out the offending locus tags/ids when merging.

It may not be needed, as i have not seen any other issues like this and mine may be unique (?) but could be a nice addition.

Traceback (most recent call last):
  File "/nethome/bdy8/mambaforge/envs/funannotate_env/bin/funannotate", line 10, in <module>
    sys.exit(main())
  File "/nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/funannotate.py", line 716, in main
    mod.main(arguments)
  File "/nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/annotate.py", line 1458, in main
    GeneNames = lib.getGeneBasename(Proteins)
  File "/nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/library.py", line 1081, in getGeneBasename
    transcript, gene = line.split(" ")
ValueError: too many values to unpack (expected 2)
benyoung93 commented 11 months ago

can confirm that once fixing those 2 genes in the update step, annotate successfully runs and I have all my files wooooooooooooooooooooo.

Let me know if you need any more information re the naming bug and I can provide it. i will leave this open untill we get to the bottom of that :).

Ben