nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Funannotate Predict Freezing #960

Open ChuChuChaddy opened 9 months ago

ChuChuChaddy commented 9 months ago

I finally got funannotate to run (using the dockerfile) after your help last time and I've hit another obstacle. The pipeline will get to busco and freeze at 90%. I looked similar issues up in this github and saw that the new Augustus can lead to freezing. I stopped using the dockerfile, installed mamba, and installed funannotate via conda and it froze in the same spot. Right now, I've installed Augustus 3.3 myself and am running this solo hoping that I can add it to the pipeline with --augustus_gff. I would like to fix this problem in case I have to redo it.

My species is a gastropod and the closest Augustus species I could find was Argopectin Irradians.

Thanks for your time and your help!

CM

Run command: /data/mansfieldc/software/mambaforge/envs/funannotate/bin/funannotate predict --cpus 32 -i ppb.f.clean.masked.fa --optimize_augustus -d /data/mansfieldc/software/funannotate-1.8.15/fundb --rna_bam ./training/trinity.alignments.bam --transcript_evidence ./training/trinity.fasta --protein_evidence ./Euk_Gas_moll_nodup.fa --genemark_gtf /data/mansfieldc/phys/genomes/ppb/fun/genemark/genemark.gtf --AUGUSTUS_CONFIG_PATH /data/mansfieldc/software/Augustus/config --out predict3 --species Argopecten_irradians --busco_db=/data/mansfieldc/software/busco/mollusca_odb10

cat funannotate-predict.log 09/11/23 14:42:46: OS: Rocky Linux 8.8, 36 cores, ~ 791 GB RAM. Python: 3.8.15 09/11/23 14:42:46: Running funannotate v1.8.15 09/11/23 14:42:46: GeneMark path: data/mansfieldc/software/gmes_linux_64_4/ 09/11/23 14:42:46: Full path to gmes_petap.pl: data/mansfieldc/software/gmes_linux_64_4/gmes_petap.pl 09/11/23 14:42:46: GeneMark appears to be functional? False 09/11/23 14:42:46: exonerate version=exonerate 2.4.0 path=/data/mansfieldc/software/mambaforge/envs/funannotate/bin/exonerate 09/11/23 14:42:46: diamond version=2.1.8 path=/data/mansfieldc/software/mambaforge/envs/funannotate/bin/diamond 09/11/23 14:42:46: tbl2asn version=25.8 path=/data/mansfieldc/software/mambaforge/envs/funannotate/bin/tbl2asn 09/11/23 14:42:46: bedtools version=bedtools v2.31.0 path=/data/mansfieldc/software/mambaforge/envs/funannotate/bin/bedtools 09/11/23 14:42:46: augustus version=3.5.0 path=/data/mansfieldc/software/mambaforge/envs/funannotate/bin/augustus 09/11/23 14:42:46: etraining version=NA path=/data/mansfieldc/software/mambaforge/envs/funannotate/bin/etraining 09/11/23 14:42:46: tRNAscan-SE version=2.0.12 (Nov 2022) path=/data/mansfieldc/software/mambaforge/envs/funannotate/bin/tRNAscan-SE 09/11/23 14:42:46: bam2hints version=NA path=/data/mansfieldc/software/mambaforge/envs/funannotate/bin/bam2hints 09/11/23 14:42:46: minimap2 version=2.26-r1175 path=/data/mansfieldc/software/mambaforge/envs/funannotate/bin/minimap2

[09/11/23 14:42:47]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 2, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1} [09/11/23 14:42:48]: {'augustus': 'busco', 'snap': 'busco', 'glimmerhmm': 'busco', 'codingquarry': 'rna-bam'} [09/11/23 14:42:48]: Parsed training data, run ab-initio gene predictors as follows: [09/11/23 14:42:48]: augustus --species=anidulans --proteinprofile=/data/mansfieldc/software/mambaforge/envs/funannotate/lib/python3.8/site-packages/funannotate/config/EOG092C0B3U.prfl /data/mansfieldc/software/mambaforge/envs/funannotate/lib/python3.8/site-packages/funannotate/config/busco_test.fa [09/11/23 14:42:49]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 2, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1} [09/11/23 14:46:05]: Loading genome assembly and parsing soft-masked repetitive sequences [09/11/23 14:46:36]: Genome loaded: 2,772 scaffolds; 703,280,298 bp; 54.27% repeats masked [09/11/23 14:46:38]: Existing transcript alignments found: predict3/predict_misc/transcript_alignments.gff3 [09/11/23 14:46:38]: Existing RNA-seq BAM hints found: predict3/predict_misc/hints.BAM.gff [09/11/23 14:49:22]: Existing protein alignments found: predict3/predict_misc/protein_alignments.gff3

[09/11/23 14:50:28]: /data/mansfieldc/software/mambaforge/envs/funannotate/lib/python3.8/site-packages/funannotate/aux_scripts/genemark_gtf2gff3.pl /data/mansfieldc/Physella/genomes/ppb/fun/genemark/genemark.gtf [09/11/23 14:50:46]: perl /data/mansfieldc/software/mambaforge/envs/funannotate/opt/evidencemodeler-1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl predict3/predict_misc/genemark.gff [09/11/23 14:51:48]: Running BUSCO to find conserved gene models for training ab-initio predictors [09/11/23 14:51:48]: /data/mansfieldc/software/mambaforge/envs/funannotate/bin/python /data/mansfieldc/software/mambaforge/envs/funannotate/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /data/mansfieldc/Physella/genomes/ppb/fun/fun/predict3/predict_misc/genome.softmasked.fa -m genome --lineage /data/mansfieldc/software/busco/mollusca_odb10 -o argopecten_irradians -c 32 --species anidulans -f --local_augustus /data/mansfieldc/Physella/genomes/ppb/fun/fun/predict3/predict_misc/ab_initio_parameters/augustus

cat busco.log

INFO ** Start a BUSCO 2.0 analysis, current time: 09/11/2023 14:51:48 ** INFO The lineage dataset is: mollusca_odb10 (eukaryota) INFO Mode is: genome INFO Maximum number of regions limited to: 3 INFO To reproduce this run: python /data/mansfieldc/software/mambaforge/envs/funannotate/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /data/mansfieldc/Physella/genomes/ppb/fun/fun/predict3/predict_misc/genome.softmasked.fa -o argopecten_irradians -l /data/mansfieldc/software/busco/mollusca_odb10/ -m genome -c 32 -sp anidulans INFO Check dependencies... INFO Check input file... INFO Temp directory is ./tmp/

INFO ** Phase 1 of 2, initial predictions ** INFO ** Step 1/3, current time: 09/11/2023 14:51:57 ** INFO Create blast database... INFO [makeblastdb] Building a new DB, current time: 09/11/2023 14:51:57 INFO [makeblastdb] New DB name: /data/mansfieldc/Physella/genomes/ppb/fun/fun/predict3/predict_misc/busco/tmp/argopecten_irradians_325409627 INFO [makeblastdb] New DB title: /data/mansfieldc/Physella/genomes/ppb/fun/fun/predict3/predict_misc/genome.softmasked.fa INFO [makeblastdb] Sequence type: Nucleotide INFO [makeblastdb] Keep MBits: T INFO [makeblastdb] Maximum file size: 3000000000B INFO [makeblastdb] Adding sequences from FASTA; added 2772 sequences in 7.31591 seconds. INFO Running tblastn, writing output to /data/mansfieldc/Physella/genomes/ppb/fun/fun/predict3/predict_misc/busco/run_argopecten_irradians/blast_output/tblastn_argopecten_irradians.tsv... INFO ** Step 2/3, current time: 09/11/2023 15:11:57 ** INFO Getting coordinates for candidate regions... INFO Pre-Augustus scaffold extraction... INFO Running Augustus prediction using anidulans as species: INFO [augustus] Please find all logs related to Augustus here: /data/mansfieldc/Physella/genomes/ppb/fun/fun/predict3/predict_misc/busco/run_argopecten_irradians/augustus_output/augustus.log INFO 09/11/2023 15:12:25 => 0% of predictions performed (6733 to be done) INFO 09/11/2023 15:19:09 => 10% of predictions performed (674/6733 candidate regions) INFO 09/11/2023 15:25:39 => 20% of predictions performed (1347/6733 candidate regions) INFO 09/11/2023 15:32:05 => 30% of predictions performed (2020/6733 candidate regions) INFO 09/11/2023 15:39:30 => 40% of predictions performed (2694/6733 candidate regions) INFO 09/11/2023 15:46:08 => 50% of predictions performed (3368/6733 candidate regions) INFO 09/11/2023 15:53:41 => 60% of predictions performed (4040/6733 candidate regions) INFO 09/11/2023 16:00:15 => 70% of predictions performed (4714/6733 candidate regions) INFO 09/11/2023 16:07:20 => 80% of predictions performed (5387/6733 candidate regions) INFO 09/11/2023 16:14:33 => 90% of predictions performed (6060/6733 candidate regions)

The augustus.log in /predict_misc/busco/run_argopecten_irradians is full of error messages such as: Warning: Block no.unknown_A is not significant enough, removed from profile. Warning: Block no.unknown_B is not significant enough, removed from profile. Warning: Block no.unknown_H is not significant enough, removed from profile. Warning: Block no.unknown_G is not significant enough, removed from profile. Warning: Block no.unknown_H is not significant enough, removed from profile.

I