nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
321 stars 85 forks source link

RNA free annotation UnicodeDecodeError #569

Closed Rob-murphys closed 3 years ago

Rob-murphys commented 3 years ago

I am trying to do an annottion without RNA-seq evidence and am running into some issues.

Version funannotate v1.8.3

*Input' funannotate predict -i $input -o $outdir --species $species --busco_seed_species $buscoSpecies -d $database --cpus 5

The error


-------------------------------------------------------
[03:00 PM]: OS: CentOS Linux 7, 40 cores, ~ 198 GB RAM. Python: 3.6.10
[03:00 PM]: Running funannotate v1.8.3
[03:00 PM]: Skipping CodingQuarry as no --rna_bam passed
[03:00 PM]: Parsed training data, run ab-initio gene predictors as follows:
Traceback (most recent call last):
  File "/services/tools/funannotate/1.8.3/bin/funannotate", line 713, in <module>
    main()
  File "/services/tools/funannotate/1.8.3/bin/funannotate", line 703, in main
    mod.main(arguments)
  File "/services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/predict.py", line 572, in main
    augustus_version, augustus_functional = lib.checkAugustusFunc()
  File "/services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/library.py", line 1039, in checkAugustusFunc
    stdout=subprocess.PIPE, universal_newlines=True).communicate()
  File "/services/tools/funannotate/1.8.3/lib/python3.6/subprocess.py", line 850, in communicate
    stdout = self.stdout.read()
  File "/services/tools/funannotate/1.8.3/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)
nextgenusfs commented 3 years ago

This is potentially an augustus related error -- although I'm not sure it is yielding the unicode decode error if you are using python3. What version of augustus do you have installed and what OS are you on? What it is checking here is for a functional proteinprofile mode of augustus, that feature/method in the augustus code seems to have various compilation issues on different operating systems. You can test it manually with:

augustus --species=anidulans /path/to/your/env/lib/python3.6/site-packages/funannotate/config/EOG092C0B3U.prfl \
    /path/to/your/env/lib/python3.6/site-packages/funannotate/config/busco_test.fa
Rob-murphys commented 3 years ago

Augustus version v3.3.3

OS Lunix kernel version 3.10.0-957.1.3.el7.x86_64

Python version Python 3.6.10 :: Anaconda, Inc.

It seems I am missing this test data in my config directory:

ls /services/tools/funannotate/1.8.3/config/
cgp/       extrinsic/ model/     profile/   species/
nextgenusfs commented 3 years ago

Those files are in the funannotate python directory not Augustus config.

Rob-murphys commented 3 years ago

Ah okay, it does not like me providing two query files:

augustus --species=anidulans /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/EOG092C0B3U.prfl /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/busco_test.fa

augustus: ERROR
        Error: 2 query files given: /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/busco_test.fa and /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/EOG092C0B3U.prfl.
parameter names must start with '--'

I assume I am missing flags but am not sure which file is what.

nextgenusfs commented 3 years ago

Oh, I'm sorry, forgot the --proteinprofile=,

augustus --species=anidulans --proteinprofile=/services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/EOG092C0B3U.prfl /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/busco_test.fa
Rob-murphys commented 3 years ago

Here is the output:

(base) [robmur@g-12-l0002 scripts]$ augustus --species=anidulans --proteinprofile=/services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/EOG092C0B3U.prfl /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/busco_test.fa
# This output was generated with AUGUSTUS (version 3.4.0).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer, L. Romoth and Katharina Hoff.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# Sources of extrinsic information: M RM
# Initializing the parameters using config directory /services/tools/augustus/3.4.0/config/ ...
Warning: Block no.unknown_E is not significant enough, removed from profile.
Warning: Block no.unknown_F is not significant enough, removed from profile.
Warning: Block no.unknown_H is not significant enough, removed from profile.
Warning: Block no.unknown_AC is not significant enough, removed from profile.
# Using protein profile unknown
# --[0..117]--> unknown_A (9) <--[2..25]--> unknown_B (27) <--[1..16]--> unknown_C (8) <--[0..1]--> unknown_D (15) <--[18..100]--> unknown_G (19) <--[8..25]--> unknown_I (32) <--[0..1]--> unknown_J (33) <--[1..16]--> unknown_K (38) <--[1..3]--> unknown_L (14) <--[0..5]--> unknown_M (59) <--[0..19]--> unknown_N (23) <--[0..145]--> unknown_O (23) <--[3..18]--> unknown_P (27) <--[1..44]--> unknown_Q (12) <--[10..82]--> unknown_R (13) <--[10..106]--> unknown_S (18) <--[1..11]--> unknown_T (32) <--[2..5]--> unknown_U (12) <--[0..1]--> unknown_V (32) <--[7..18]--> unknown_W (13) <--[3..8]--> unknown_X (87) <--[0..1]--> unknown_Y (12) <--[2..33]--> unknown_Z (40) <--[0..11]--> unknown_AA (16) <--[3..30]--> unknown_AB (19) <--[8..47]--> unknown_AD (23) <--[0..1]--> unknown_AE (13) <--[0..38]--
# anidulans version. Using default transition matrix.
# Looks like /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/busco_test.fa is in fasta format.
# We have hints for 0 sequences and for 0 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 3801, name = example) -----
#
# Predicted genes for sequence number 1 on both strands
# start gene g1
example AUGUSTUS        gene    788     3077    0.96    +       .       g1
example AUGUSTUS        transcript      788     3077    0.96    +       .       g1.t1
example AUGUSTUS        start_codon     788     790     .       +       0       transcript_id "g1.t1"; gene_id "g1";
example AUGUSTUS        CDS     788     996     1       +       0       transcript_id "g1.t1"; gene_id "g1";
example AUGUSTUS        CDS     1049    3077    0.96    +       1       transcript_id "g1.t1"; gene_id "g1";
example AUGUSTUS        stop_codon      3075    3077    .       +       0       transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MDISDLIEPPQKRLKTEDISSADEVVLPAGGITPQTDNEIDEQLSKEIEVGITEFVSADNEGFAGILKKRYTDFLVNE
# ILPSGKVLHLTNTTAPNTNDEATPVQADKKPAEDKPKEPETPAEKLPAPVEFQLAEEDEALLDTLFGTQNTKKIVALHKKALANPKTKPSDLGRLNTV
# VVNDRDQRIKMHQAIRRIFNSQIESSTDSEGMMVISVAANRNKKNPQGGGGGRERPRVNWDELGGQYLHFTIYKENKDTMEVISFIARQLKMNPKSFQ
# FAGTKDRRGVTVQRACAYRLQADRLAKLNRTLRNAVVGDFEYQPHGLELGDLYGNEFVVTLRECEVPGINIQDPASAVAKTKELVNTSLKNLYQRGYF
# NYYGLQRFGSFATRTDTVGVKILQDDFKGACDAILDYSPHILAAAQAELGQGEGEGATPTNISSEDKARALAIHIFRTTDRVTDALEKMPRKFSAESN
# IIRHLGRSKNDYLGALQTIPRNLRLMYVHAYQSLVWNLAVGERWRLYGDRVVEGDLVLIHEHRDKDGNSSYTTPAPGAGASGETTTIDADGEIIIVPQ
# EHDSAFAVEDTFTRARALTAAEANSGLYSIFDIVLPLPGFDVLYPPNKMTDFYKEFMGSSRGGGLDPFNMRRKWKDASLSGSYRKVLSRMGRDYSVDV
# VLYSRDEEQFVRTDLENLTLKTRDGGDVDLEKKEGKSEGDKLAVVLKFQLGSSQYATMALRELMRGKVKAYKPDFGGGR]
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 0
# CDS exons: 0/2
# CDS introns: 0/1
# 5'UTR exons and introns: 0/0
# 3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 0
# end gene g1
###
# command line:
# augustus --species=anidulans --proteinprofile=/services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/EOG092C0B3U.prfl /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/busco_test.fa
nextgenusfs commented 3 years ago

Okay, well looks like it is working, but v3.4 won't work currently with funannotate and the BUSCO mediated generation of training models.

# This output was generated with AUGUSTUS (version 3.4.0).
Rob-murphys commented 3 years ago

Ah apart from what I just showed you now, all runs have been on Augustus v3.3.3

Here is the output on v3.3.3:


(base) [robmur@g-12-l0002 ~]$ augustus --species=anidulans --proteinprofile=/services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/EOG092C0B3U.prfl /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/busco_test.fa
# This output was generated with AUGUSTUS (version 3.3.3).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer, L. Romoth and Katharina Hoff.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# Initializing the parameters using config directory /services/tools/funannotate/1.8.3/config/ ...
Warning: Block unknown_E is not significant enough, removed from profile.
Warning: Block unknown_F is not significant enough, removed from profile.
Warning: Block unknown_H is not significant enough, removed from profile.
Warning: Block unknown_AC is not significant enough, removed from profile.
# Using protein profile unknown
# --[0..117]--> unknown_A (9) <--[2..25]--> unknown_B (27) <--[1..16]--> unknown_C (8) <--[0..1]--> unknown_D (15) <--[18..100]--> unknown_G (19) <--[8..25]--> unknown_I (32) <--[0..1]--> unknown_J (33) <--[1..16]--> unknown_K (38) <--[1..3]--> unknown_L (14) <--[0..5]--> unknown_M (59) <--[0..19]--> unknown_N (23) <--[0..145]--> unknown_O (23) <--[3..18]--> unknown_P (27) <--[1..44]--> unknown_Q (12) <--[10..82]--> unknown_R (13) <--[10..106]--> unknown_S (18) <--[1..11]--> unknown_T (32) <--[2..5]--> unknown_U (12) <--[0..1]--> unknown_V (32) <--[7..18]--> unknown_W (13) <--[3..8]--> unknown_X (87) <--[0..1]--> unknown_Y (12) <--[2..33]--> unknown_Z (40) <--[0..11]--> unknown_AA (16) <--[3..30]--> unknown_AB (19) <--[8..47]--> unknown_AD (23) <--[0..1]--> unknown_AE (13) <--[0..38]--
# anidulans version. Using default transition matrix.
# Looks like /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/busco_test.fa is in fasta format.
# We have hints for 0 sequences and for 0 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 3801, name = example) -----
#
# Predicted genes for sequence number 1 on both strands
# start gene g1
example AUGUSTUS        gene    788     3077    0.88    +       .       g1
example AUGUSTUS        transcript      788     3077    0.88    +       .       g1.t1
example AUGUSTUS        start_codon     788     790     .       +       0       transcript_id "g1.t1"; gene_id "g1";
example AUGUSTUS        CDS     788     996     1       +       0       transcript_id "g1.t1"; gene_id "g1";
example AUGUSTUS        CDS     1049    3077    0.88    +       1       transcript_id "g1.t1"; gene_id "g1";
example AUGUSTUS        stop_codon      3075    3077    .       +       0       transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MDISDLIEPPQKRLKTEDISSADEVVLPAGGITPQTDNEIDEQLSKEIEVGITEFVSADNEGFAGILKKRYTDFLVNE
# ILPSGKVLHLTNTTAPNTNDEATPVQADKKPAEDKPKEPETPAEKLPAPVEFQLAEEDEALLDTLFGTQNTKKIVALHKKALANPKTKPSDLGRLNTV
# VVNDRDQRIKMHQAIRRIFNSQIESSTDSEGMMVISVAANRNKKNPQGGGGGRERPRVNWDELGGQYLHFTIYKENKDTMEVISFIARQLKMNPKSFQ
# FAGTKDRRGVTVQRACAYRLQADRLAKLNRTLRNAVVGDFEYQPHGLELGDLYGNEFVVTLRECEVPGINIQDPASAVAKTKELVNTSLKNLYQRGYF
# NYYGLQRFGSFATRTDTVGVKILQDDFKGACDAILDYSPHILAAAQAELGQGEGEGATPTNISSEDKARALAIHIFRTTDRVTDALEKMPRKFSAESN
# IIRHLGRSKNDYLGALQTIPRNLRLMYVHAYQSLVWNLAVGERWRLYGDRVVEGDLVLIHEHRDKDGNSSYTTPAPGAGASGETTTIDADGEIIIVPQ
# EHDSAFAVEDTFTRARALTAAEANSGLYSIFDIVLPLPGFDVLYPPNKMTDFYKEFMGSSRGGGLDPFNMRRKWKDASLSGSYRKVLSRMGRDYSVDV
# VLYSRDEEQFVRTDLENLTLKTRDGGDVDLEKKEGKSEGDKLAVVLKFQLGSSQYATMALRELMRGKVKAYKPDFGGGR]
# end gene g1
###
# command line:
# augustus --species=anidulans --proteinprofile=/services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/EOG092C0B3U.prfl /services/tools/funannotate/1.8.3/lib/python3.6/site-packages/funannotate/config/busco_test.fa
nextgenusfs commented 3 years ago

The unicode error is seriously this one line S. König, -- the umlaut in the name was a problem with py2.7. But I can't reproduce the unicode error locally with any version of python 3. I'm installing on centOS at the moment, let me see if I can get similar behavior.

Rob-murphys commented 3 years ago

Awesome thanks for helping me out :)

nextgenusfs commented 3 years ago

I'm unable to reproduce this on centOS 7 with python 3.7.8 and Augustus 3.3.3

$ lsb_release -d
Description:    CentOS Linux release 7.9.2009 (Core)

Created conda environmnet with mamba (faster solver than conda):

mamba create -n funannotate funannotate
$ conda activate funannotate
$ which python
/apps/miniconda3/envs/funannotate/bin/python

$ python
Python 3.7.8 | packaged by conda-forge | (default, Nov 17 2020, 23:42:15)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import funannotate.library as lib
>>> lib.checkAugustusFunc()
('AUGUSTUS (3.3.3)', True)
>>>
Rob-murphys commented 3 years ago

Could the different python version be the issue? v3.6.10 vs v3.7.8

nextgenusfs commented 3 years ago

I'm not seeing that either, this was a quick way to check:

mamba create -n py3610 "python==3.6.10" "augustus==3.3.3"

$ conda activate py3610
$ python -m pip install funannotate
$ which python
/apps/miniconda3/envs/py3610/bin/python
(py3610) $ python
Python 3.6.10 | packaged by conda-forge | (default, Apr 24 2020, 16:42:08)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import funannotate.library as lib
>>> lib.checkAugustusFunc()
('AUGUSTUS (3.3.3)', True)
>>>
nextgenusfs commented 3 years ago

How are you loading your environment? Could be something in the order/method that you are activating the environment.

Rob-murphys commented 3 years ago

The cluster is uses Environment Modules. I am loading in the following modules when using funannotate:

module load tools perl genemark-es/4.62 signalp/4.1c funannotate/1.8.3

I don't bother loading Augustus as we point Funannotate directory to in on the command line

nextgenusfs commented 3 years ago

I'm afraid I'm not going to be much help with that setup as I haven't used it -- @hyphaltip any experience with this setup?

Rob-murphys commented 3 years ago

As far as I understand it mostly just sets environment variables when you load in a specific module. I will share a detailed as I can report of my environment when I have access to the cluster again tomorrow.

hyphaltip commented 3 years ago

the order is I usually load the modules and then load the conda env in our module system. This is our module where we do module loads and then conda env load https://github.com/ucr-hpcc/hpcc_modules/blob/master/funannotate/1.8.4

Rob-murphys commented 3 years ago

I am just going to try with your conda distribution and see if that solves the issues.

Rob-murphys commented 3 years ago

How do we stop funannotate saving what I assume are temporary files to $HOME?

E.g.: p2g_1ae9bfde-5527-4b05-8fc2-ce0c723546d3

nextgenusfs commented 3 years ago

It assumes you have read/write privileges in the directory in which you launched the script. so that is a temp folder processing the protein2genome alignments.

Rob-murphys commented 3 years ago

These are being generated in a directory different to the one I am launching the scripts from I believe.