nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

BUSCO-mediated training failure in 1.8.17, but not in 1.8.15 #1071

Open JWDebler opened 2 weeks ago

JWDebler commented 2 weeks ago

I just installed 1.8.17 on a new system and am going through the test pipeline fixing errors.

The current one though I am not sure what to do about.

It happens during BUSCO-mediated training. However, the same step finishes fine on my old machine with 1.8.15 (below)

1.8.17:

Running `funannotate predict` BUSCO-mediated training unit testing
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --cpus 32 --species Awesome busco
#########################################################
-------------------------------------------------------
[Oct 07 05:38 AM]: OS: Ubuntu 24.04, 32 cores, ~ 247 GB RAM. Python: 3.9.19
[Oct 07 05:38 AM]: Running funannotate v1.8.17
[Oct 07 05:38 AM]: Skipping CodingQuarry as no --rna_bam passed
[Oct 07 05:38 AM]: Parsed training data, run ab-initio gene predictors as follows:
  Program      Training-Method
  augustus     busco
  genemark     selftraining
  glimmerhmm   busco
  snap         busco
[Oct 07 05:38 AM]: Loading genome assembly and parsing soft-masked repetitive sequences
[Oct 07 05:38 AM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/aux_scripts/funannotate-p2g.py:14: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  from pkg_resources import parse_version
[Oct 07 05:38 AM]: Mapping 1,065 proteins to genome using diamond and exonerate
[Oct 07 05:38 AM]: Found 1,505 preliminary alignments with diamond in 0:00:01 --> generated FASTA files for exonerate in 0:00:00
     Progress: 1505 complete, 0 failed, 0 remaining
[Oct 07 05:38 AM]: Exonerate finished in 0:00:10: found 1,270 alignments
[Oct 07 05:38 AM]: Running GeneMark-ES on assembly
[Oct 07 05:39 AM]: 1,566 predictions from GeneMark
[Oct 07 05:39 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[Oct 07 05:42 AM]: 370 valid BUSCO predictions found, validating protein sequences
[Oct 07 05:42 AM]: 189 BUSCO predictions validated
[Oct 07 05:42 AM]: Not enough gene models 189 to train Augustus (200 required), exiting
#########################################################
Traceback (most recent call last):
  File "/data/mamba_envs/envs/funannotate/bin/funannotate", line 10, in <module>
    sys.exit(main())
  File "/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/funannotate.py", line 717, in main
    mod.main(arguments)
  File "/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/test.py", line 407, in main
    runBuscoTest(args)
  File "/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/test.py", line 200, in runBuscoTest
    assert 1500 <= countGFFgenes(os.path.join(
  File "/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/test.py", line 45, in countGFFgenes
    with open(input, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'test-busco_07886a94-a47b-4261-a14b-f81f46fe307d/annotate/predict_results/Awesome_busco.gff3'

1.8.15:

#########################################################                                                                                                                                                                                    Running `funannotate predict` BUSCO-mediated training unit testing                                                                                                                                                                           CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --cpus 16 --species Awesome busco                                                                                                       #########################################################                                                                                                                                                                                    -------------------------------------------------------                                                                                                                                                                                      [Oct 07 05:26 AM]: OS: Ubuntu 18.04, 16 cores, ~ 66 GB RAM. Python: 3.8.15                                                                                                                                                                   [Oct 07 05:26 AM]: Running funannotate v1.8.15                                                                                                                                                                                               [Oct 07 05:26 AM]: Skipping CodingQuarry as no --rna_bam passed                                                                                                                                                                              [Oct 07 05:26 AM]: Parsed training data, run ab-initio gene predictors as follows:
  Program      Training-Method
  augustus     busco
  genemark     selftraining
  glimmerhmm   busco
  snap         busco
[Oct 07 05:26 AM]: Loading genome assembly and parsing soft-masked repetitive sequences
[Oct 07 05:26 AM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
[Oct 07 05:26 AM]: Mapping 1,065 proteins to genome using diamond and exonerate
[Oct 07 05:26 AM]: Found 1,505 preliminary alignments with diamond in 0:00:01 --> generated FASTA files for exonerate in 0:00:00
     Progress: 1505 complete, 0 failed, 0 remaining
[Oct 07 05:26 AM]: Exonerate finished in 0:00:10: found 1,270 alignments
[Oct 07 05:26 AM]: Running GeneMark-ES on assembly
[Oct 07 05:27 AM]: 1,558 predictions from GeneMark
[Oct 07 05:27 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[Oct 07 05:30 AM]: 370 valid BUSCO predictions found, validating protein sequences
[Oct 07 05:31 AM]: 367 BUSCO predictions validated
[Oct 07 05:31 AM]: Training Augustus using BUSCO gene models
[Oct 07 05:31 AM]: Augustus initial training results:
  Feature       Specificity   Sensitivity
  nucleotides   99.4%         83.8%
  exons         63.2%         52.6%
  genes         76.7%         51.4%
[Oct 07 05:31 AM]: Running Augustus gene prediction using awesome_busco parameters
     Progress: 11 complete, 0 failed, 0 remaining
[Oct 07 05:31 AM]: 1,284 predictions from Augustus
[Oct 07 05:31 AM]: Pulling out high quality Augustus predictions
[Oct 07 05:31 AM]: Found 306 high quality predictions from Augustus (>90% exon evidence)
[Oct 07 05:31 AM]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[Oct 07 05:32 AM]: 1,391 predictions from SNAP
[Oct 07 05:32 AM]: Running GlimmerHMM gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[Oct 07 05:32 AM]: 1,775 predictions from GlimmerHMM
[Oct 07 05:32 AM]: Summary of gene models passed to EVM (weights):
  Source         Weight   Count
  Augustus       1        978
  Augustus HiQ   2        306
  GeneMark       1        1558
  GlimmerHMM     1        1775
  snap           1        1391
  Total          -        6008

As can be seen, both versions find the same number of BUSCO predictions (370), but 1.8.17 can only validate 189 of them, crashing the pipeline as AUGUSTUS requires at least 200. 1.8.15 however manages to validate 367.

Versions:

-------------------------------------------------------
Checking dependencies for 1.8.17
-------------------------------------------------------
You are running Python v 3.9.19. Now checking python packages...
biopython: 1.79
goatools: 1.4.12
matplotlib: 3.9.2
natsort: 8.4.0
numpy: 1.26.4
pandas: 2.2.3
psutil: 6.0.0
requests: 2.32.3
scikit-learn: 1.5.2
scipy: 1.13.1
seaborn: 0.13.2
All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules...
Carp: 1.50
Clone: 0.46
DBD::SQLite: 1.72
DBD::mysql: 4.050
DBI: 1.643
DB_File: 1.858
Data::Dumper: 2.183
File::Basename: 2.85
File::Which: 1.24
Getopt::Long: 2.58
Hash::Merge: 0.302
JSON: 4.10
LWP::UserAgent: 6.67
Logger::Simple: 2.0
POSIX: 1.94
Parallel::ForkManager: 2.03
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.14
Tie::File: 1.06
URI::Escape: 5.17
YAML: 1.30
local::lib: 2.000029
threads: 2.25
threads::shared: 1.61
All 27 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/data/databases/
$PASAHOME=/data/mamba_envs/envs/funannotate/opt/pasa-2.5.3
$TRINITY_HOME=/data/mamba_envs/envs/funannotate/opt/trinity-2.15.2
$EVM_HOME=/data/mamba_envs/envs/funannotate/opt/evidencemodeler-2.1.0
$AUGUSTUS_CONFIG_PATH=/data/mamba_envs/envs/funannotate/config/
$GENEMARK_PATH=/opt/genemark/current/
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.5.3
CodingQuarry: 2.0
Trinity: 2.15.2
augustus: 3.5.0
bamtools: bamtools 2.5.2
bedtools: bedtools v2.31.1
blat: BLAT v39x1
diamond: 2.1.8
emapper.py: 2.1.12
ete3: 3.1.3
exonerate: exonerate 2.4.0
fasta: 36.3.8g
glimmerhmm: 3.0.4
gmap: 2024-09-18
gmes_petap.pl: 4.71_lic
hisat2: 2.2.1
hmmscan: HMMER 3.4 (Aug 2023)
hmmsearch: HMMER 3.4 (Aug 2023)
java: 22.0.1-internal
kallisto: 0.46.1
mafft: v7.526 (2024/Apr/26)
makeblastdb: makeblastdb 2.16.0+
minimap2: 2.28-r1209
pigz: 2.8
proteinortho: 6.3.2
pslCDnaFilter: no way to determine
salmon: salmon 1.10.3
samtools: samtools 1.21
signalp: 6.0
snap: 2006-07-28
stringtie: 2.2.3
tRNAscan-SE: 2.0.12 (Nov 2022)
tantan: tantan 50
tbl2asn: 25.8
tblastn: tblastn 2.16.0+
trimal: trimAl v1.5.rev0 build[2024-05-27]
trimmomatic: 0.39
All 37 external dependencies are installed
-------------------------------------------------------
Checking dependencies for 1.8.15
-------------------------------------------------------
You are running Python v 3.8.15. Now checking python packages...
biopython: 1.81
goatools: 1.2.3
matplotlib: 3.4.3
natsort: 8.3.1
numpy: 1.24.3
pandas: 1.5.3
psutil: 5.9.5
requests: 2.29.0
scikit-learn: 1.2.2
scipy: 1.10.1
seaborn: 0.12.2
All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules...
Carp: 1.50
Clone: 0.46
DBD::SQLite: 1.72
DBD::mysql: 4.046
DBI: 1.643
DB_File: 1.858
Data::Dumper: 2.183
File::Basename: 2.85
File::Which: 1.24
Getopt::Long: 2.54
Hash::Merge: 0.302
JSON: 4.10
LWP::UserAgent: 6.67
Logger::Simple: 2.0
POSIX: 1.94
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.14
Tie::File: 1.06
URI::Escape: 5.12
YAML: 1.30
local::lib: 2.000029
threads: 2.25
threads::shared: 1.61
All 27 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/data/databases/
$PASAHOME=/home/ubuntu/mambaforge/envs/funannotate/opt/pasa-2.5.2
$TRINITY_HOME=/home/ubuntu/mambaforge/envs/funannotate/opt/trinity-2.8.5
$EVM_HOME=/home/ubuntu/mambaforge/envs/funannotate/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/home/ubuntu/mambaforge/envs/funannotate/config/
$GENEMARK_PATH=/opt/genemark/
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.5.2
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.5.0
bamtools: bamtools 2.5.1
bedtools: bedtools v2.30.0
blat: BLAT v36x2
diamond: 2.1.6
emapper.py: 2.1.12
ete3: 3.1.2
exonerate: exonerate 2.4.0
fasta: 36.3.8g
glimmerhmm: 3.0.4
gmap: 2023-03-24
gmes_petap.pl: 4.71_lic
hisat2: 2.2.1
hmmscan: HMMER 3.3.2 (Nov 2020)
hmmsearch: HMMER 3.3.2 (Nov 2020)
java: 17.0.3-internal
kallisto: 0.46.1
mafft: v7.520 (2023/Mar/22)
makeblastdb: makeblastdb 2.13.0+
minimap2: 2.26-r1175
pigz: 2.6
proteinortho: 6.2.3
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.16.1
signalp: 4.1
snap: 2006-07-28
stringtie: 2.2.1
tRNAscan-SE: 2.0.11 (Oct 2022)
tantan: tantan 40
tbl2asn: 25.8
tblastn: tblastn 2.13.0+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
All 37 external dependencies are installed

Any suggestions?

Cheers.

nextgenusfs commented 4 days ago

Are these two separate augustus installs? Can you look at the specific build numbers from conda? I'm guess there is an issue with the build version in 1.8.17 install. If BUSCO fails its almost always an Augustus issue....

JWDebler commented 3 days ago

Yes, they're installed on separate virtual machines, both were setup via conda with what was the latest funannotate version at the time. The one in the 1.8.15 install: augustus 3.5.0 pl5321hf46c7bb_1 bioconda The one in the 1.8.17 install: augustus 3.5.0 pl5321h95201ac_4 bioconda

nextgenusfs commented 3 days ago

Okay so _4 build is the problem. Force install _1 build and should work.

JWDebler commented 3 days ago

I just tried that and it complained

mamba install augustus=3.5.0=pl5321hf46c7bb_1

image

nextgenusfs commented 3 days ago

Would this work? mamba install "augustus==3.5.0,!=3.5.0=pl5321h95201ac_4"

nextgenusfs commented 3 days ago

In your particular case, you can just install funannotate via pip in your v1.8.15 environment which has a working augustus installation. ie

python -m pip install "funannotate==1.8.17"
JWDebler commented 2 days ago

Would this work? mamba install "augustus==3.5.0,!=3.5.0=pl5321h95201ac_4"

The following package could not be installed
└─ augustus ==3.5.0,!=3.5.0 pl5321h95201ac_4 does not exist (perhaps a typo or a missing channel).

That syntax is incorrect.

In your particular case, you can just install funannotate via pip in your v1.8.15 environment which has a working augustus installation. ie

I probably could, but that machine is about to get deleted, which is why I am setting everything up on a new one.

I have tried installing other Augustus builds, but keep running into the same libboost dependency problems. If I remember correctly when I set up this envirionment that was a problem and had to be installed separately after installing funannotate via conda.

I am currently running a few full genomes through the pipeline and everything works fine, it's just the BUSCO validation that returns fewer validated genes than the previous version did.

image

nextgenusfs commented 2 days ago

Frustrating!

You can certainly compile Augustus manually and link it to the conda environment. I actually use a dockerized version locally as none of them will work on apple-silicon..... https://github.com/nextgenusfs/dockerized-augustus. This is hacky but it works, I just put the dockerized scripts in the PATH....

mencian commented 1 day ago

Jumping in here; I've rebuilt augustus here, could you see if the new Augustus build fixes the issue?

JWDebler commented 1 day ago

Installed your latest Augustus build, but the test pipeline still fails due to validating too few BUSCO models.

image

nextgenusfs commented 21 hours ago

Okay digging into this more. I setup a docker install of funannotate v1.8.17 installed via conda in order to test. What I'm seeing in the filtering step (which extracts the protein sequences and then does an all-vs-all to ensure that all gene calls are more than 80% divergent) is this (which is clearly wrong).

>gene364.t1 gene364
MCGIFAAFKHEDIHNFKPKALQLSKKIRHRGPDWSGNAVMNSTIFVHERLAIVGLDSGAQPITSADGEYMLGVNGEIYNH
IQLREMCSDYKFQTFSDCEPIIPLYLEHDIDAPKYLDGMFAFCLYDSKKDRIVAARDPIGVVTLYMGRSSQSPETVYFAS
ELKCLTDVCDSIISFPPGHVYDSETDKITRYFTPDWLDEKRIPSTPVDYHAIRHSLEKAVRKRLMAEVPYGVLLSGGLDS
SLIAAIAARETEKANADANEDNNVDEKQLAGIDDQGHLHTSGWSRLHSFAIGLPNAPDLQAARKVAKFIGSIHHEHTFTL
QEGLDALDDVIYHLETYDVTTIRASTPMFLLSRKIKAQGVKMVLSGEGSDEIFGGYLYFAQAPSAAEFHTESVQRVKNLH
LADCLRANKSTMAWGLEARVPFLDKDFLQLCMNIDPNEKMIKPKEGRIEKYILRKAFDTTDEPDVKPYLPEEILWRQKEQ
FSDGVGYSWIDGLRDTAERAISDAMFANPKADWGDDIPTTKEAYWYRLKFDAWFPQKTAADTVMRWIPKADWGCAEDPSG
RYAKIHEKHVSA**
>gene365.t1 gene365
N
>gene366.t1 gene366
D
>gene367.t1 gene367
S
>gene368.t1 gene368
MGEKRNRNGKDANSQNRKKFKVSSGFLDPGTSGIYATCSRRHERQAAQELQLLFEEKFQELYGDIKEGEDESENDEKKDL
SIEDQIKKELQELKGEETGKDLSSGETKKKDPLAFIDLNCECVTFCKTRKPIVPEEFVLSIMKDLADPKNMVKRTRYVQK
LTPITYSCNAKMEQLIKLANLVIGPHFHDPSNVKKNYKFAVEVTRRNFNTIERMDIINQVVKLVNKEGSEFNHTVDLKNY
DKLILVECFKSNIGMCVVDGDYKTKYRKYNVQQLYESKFRKDEDKSVKQ**
>gene369.t1 gene369
N
>gene370.t1 gene370
D

Still trying to figure out if related to Augustus build or the other possibility is this is a python 3.8 vs python 3.9 issue, ie in relation to how the code is parsing the Augustus results.