nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

funannotate predict fail at predict_misc stage #404

Closed Jennie17 closed 3 years ago

Jennie17 commented 4 years ago

Are you using the latest release? yes, installed via conda.

Describe the bug Randomly silently die toward the end of the stages of the following: Program Training-Method augustus busco
glimmerhmm busco
snap busco

What command did you issue? funannotate predict \ -i groups.asm.cleaned.sorted.masked.fasta \ -o FunannoOutput \ -s prefixFunAnno \ --isolate w64P \ --name prefixFunAnno_ \ --protein_evidence proteins.fasta \ --transcript_evidence refG_176ID_novo.fasta \ --cpus 8

Logfiles One log file silently died at the final command of the below:

[04/04/20 19:57:03]: /home/jean2573/.conda/envs/funannotate174/bin/python /home/jean2573/.conda/envs/funannotate174/lib/python2.7/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/docs/2020/doc_hc_hybrid/monoploid/default_configuration/analysis_files/funanno/predict_misc/genome.softmasked.fa -m genome --lineage /scratch/w6411/jeanniehome/tools/funannotate_database/dikarya -o p64phased0_w6411p -c 8 --species anidulans -f --local_augustus /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/docs/2020/doc_hc_hybrid/monoploid/default_configuration/analysis_files/funanno/predict_misc/ab_initio_parameters/augustus

The 2nd run died a bit later at the final command as below:

[04/06/20 21:08:37]: 733 total gene models from EVM, now validating with BUSCO HMM search [04/06/20 21:08:37]: /home/jean2573/.conda/envs/funannotate174/opt/evidencemodeler-1.1.1/EvmUtils/gff3_file_to_proteins.pl /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/docs/2020/doc_hc_hybrid/monoploid/default_configuration/analysis_files/funanno/predict_misc/busco.evm.gff3 /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/docs/2020/doc_hc_hybrid/monoploid/default_configuration/analysis_files/funanno/predict_misc/genome.softmasked.fa

OS/Install Information

You are running Perl v 5.026002. Now checking perl modules... Bio::Perl: 1.007002
Carp: 1.38
Clone: 0.42
DBD::SQLite: 1.64
DBD::mysql: 4.046
DBI: 1.642
DB_File: 1.852
Data::Dumper: 2.173
File::Basename: 2.85
File::Which: 1.23
Getopt::Long: 2.5
Hash::Merge: 0.300
JSON: 4.02
LWP::UserAgent: 6.39
Logger::Simple: 2.0
POSIX: 1.76
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.12
Tie::File: 1.02
URI::Escape: 3.31
YAML: 1.29
threads: 2.15
threads::shared: 1.56
All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/scratch/WR6411/jeanniehome/tools/funannotate_database $PASAHOME=/home/jean2573/.conda/envs/funannotate174/opt/pasa-2.4.1
$TRINITY_HOME=/home/jean2573/.conda/envs/funannotate174/opt/trinity-2.8.5 $EVM_HOME=/home/jean2573/.conda/envs/funannotate174/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/home/jean2573/.conda/envs/funannotate174/config/
ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir


Checking external dependencies...
PASA: 2.4.1
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.3.2 bamtools: bamtools 2.5.1 bedtools: bedtools v2.29.2 blat: BLAT v36 diamond: 0.9.24 exonerate: exonerate 2.4.0 fasta: no way to determine glimmerhmm: 3.0.4 gmap: 2017-11-15 hisat2: 2.2.0 hmmscan: HMMER 3.3 (Nov 2019) hmmsearch: HMMER 3.3 (Nov 2019) java: 11.0.1-internal kallisto: 0.46.2 mafft: v7.455 (2019/Dec/7) makeblastdb: makeblastdb 2.2.31+ minimap2: 2.17-r941 proteinortho: 6.0.14 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.9 snap: 2006-07-28 stringtie: 2.1.1 tRNAscan-SE: 2.0.5 (October 2019) tantan: tantan 13 tbl2asn: no way to determine, likely 25.X tblastn: tblastn 2.2.31+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: emapper.py not installed ERROR: ete3 not installed ERROR: gmes_petap.pl not installed ERROR: signalp not installed

sklcusa commented 4 years ago

How about add "--augustus_species" parameter and run again?

nextgenusfs commented 4 years ago

Hi @Jennie17, its most likely an issue with Augustus. What does the initial BUSCO log file look like? Any clues in there as to why it died?

Jennie17 commented 4 years ago

Hi @nextgenusfs, thanks for your reply.

I pasted two runs of my BUSCO log files as below.

Could you please suggest how to handle this error?

  1. The BUSCO log file for the first run whch died earlier as below. INFO ** Start a BUSCO 2.0 analysis, current time: 04/04/2020 19:57:03 ** INFO The lineage dataset is: dikarya_odb9 (eukaryota) INFO Mode is: genome INFO Maximum number of regions limited to: 3 INFO To reproduce this run: python /home/jean2573/.conda/envs/funannotate174/lib/python2.7/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /scratch/RDS-FSC-20p-RW/jeanniewu/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/genome.softmasked.fa -o p64phased0_WR6411p -l /scratch/WR6411/jeanniehome/tools/funannotate_database/dikarya/ -m genome -c 8 -sp anidulans INFO Check dependencies... INFO Check input file... INFO Temp directory is ./tmp/

INFO ** Phase 1 of 2, initial predictions ** INFO ** Step 1/3, current time: 04/04/2020 19:57:07 ** INFO Create blast database... INFO [makeblastdb] Building a new DB, current time: 04/04/2020 19:57:11 INFO [makeblastdb] New DB name: ./tmp/p64phased0_WR6411p_2761509396 INFO [makeblastdb] New DB title: /scratch/RDS-FSC-20p-RW/jeanniewu/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/genome.softmasked.fa INFO [makeblastdb] Sequence type: Nucleotide INFO [makeblastdb] Keep Linkouts: T INFO [makeblastdb] Keep MBits: T INFO [makeblastdb] Maximum file size: 1000000000B INFO [makeblastdb] Adding sequences from FASTA; added 43 sequences in 3.49766 seconds. INFO Running tblastn, writing output to /scratch/RDS-FSC-20p-RW/jeanniewu/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/busco/run_p64phased0_WR6411p/blast_output/tblastn_p64phased0_WR6411p.tsv... INFO ** Step 2/3, current time: 04/04/2020 19:59:44 ** INFO Getting coordinates for candidate regions... INFO Pre-Augustus scaffold extraction... INFO Running Augustus prediction using anidulans as species: INFO [augustus] Please find all logs related to Augustus here: /scratch/RDS-FSC-20p-RW/jeanniewu/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/busco/run_p64phased0_WR6411p/augustus_output/augustus.log INFO 04/04/2020 19:59:46 => 0% of predictions performed (1373 to be done) INFO 04/04/2020 20:01:24 => 10% of predictions performed (152/1373 candidate regions) INFO 04/04/2020 20:02:58 => 20% of predictions performed (289/1373 candidate regions) INFO 04/04/2020 20:04:43 => 30% of predictions performed (426/1373 candidate regions) INFO 04/04/2020 20:06:26 => 40% of predictions performed (563/1373 candidate regions) INFO 04/04/2020 20:08:09 => 50% of predictions performed (701/1373 candidate regions) INFO 04/04/2020 20:09:24 => 60% of predictions performed (838/1373 candidate regions) INFO 04/04/2020 20:11:03 => 70% of predictions performed (975/1373 candidate regions) INFO 04/04/2020 20:12:23 => 80% of predictions performed (1113/1373 candidate regions) INFO 04/04/2020 20:13:51 => 90% of predictions performed (1250/1373 candidate regions) INFO 04/04/2020 20:15:30 => 100% of predictions performed INFO Extracting predicted proteins... INFO ** Step 3/3, current time: 04/04/2020 20:15:53 ** INFO Running HMMER to confirm orthology of predicted proteins: INFO 04/04/2020 20:15:53 => 0% of predictions performed (1349 to be done) INFO 04/04/2020 20:15:55 => 10% of predictions performed (151/1349 candidate proteins) INFO 04/04/2020 20:15:57 => 20% of predictions performed (288/1349 candidate proteins) INFO 04/04/2020 20:15:59 => 30% of predictions performed (419/1349 candidate proteins) INFO 04/04/2020 20:16:00 => 40% of predictions performed (554/1349 candidate proteins) INFO 04/04/2020 20:16:02 => 50% of predictions performed (690/1349 candidate proteins) INFO 04/04/2020 20:16:04 => 60% of predictions performed (823/1349 candidate proteins) INFO 04/04/2020 20:16:06 => 70% of predictions performed (958/1349 candidate proteins) INFO 04/04/2020 20:16:08 => 80% of predictions performed (1096/1349 candidate proteins) INFO 04/04/2020 20:16:11 => 90% of predictions performed (1228/1349 candidate proteins)

  1. The 2nd run of BUSCO log file whch seems to can run longer than the 1st run:

INFO ** Start a BUSCO 2.0 analysis, current time: 04/06/2020 20:46:16 ** INFO The lineage dataset is: dikarya_odb9 (eukaryota) INFO Mode is: genome INFO Maximum number of regions limited to: 3 INFO To reproduce this run: python /home/jean2573/.conda/envs/funannotate174/lib/python2.7/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/genome.softmasked.fa -o p64phased0_WR6411p -l /scratch/WR6411/jeanniehome/tools/funannotate_database/dikarya/ -m genome -c 8 -sp anidulans INFO Check dependencies... INFO Check input file... INFO Temp directory is ./tmp/

INFO ** Phase 1 of 2, initial predictions ** INFO ** Step 1/3, current time: 04/06/2020 20:46:18 ** INFO Create blast database... INFO [makeblastdb] Building a new DB, current time: 04/06/2020 20:46:19 INFO [makeblastdb] New DB name: /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/busco/tmp/p64phased0_WR6411p_153445145 INFO [makeblastdb] New DB title: /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/genome.softmasked.fa INFO [makeblastdb] Sequence type: Nucleotide INFO [makeblastdb] Keep Linkouts: T INFO [makeblastdb] Keep MBits: T INFO [makeblastdb] Maximum file size: 1000000000B INFO [makeblastdb] Adding sequences from FASTA; added 43 sequences in 1.95045 seconds. INFO Running tblastn, writing output to /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/busco/run_p64phased0_WR6411p/blast_output/tblastn_p64phased0_WR6411p.tsv... INFO ** Step 2/3, current time: 04/06/2020 20:48:50 ** INFO Getting coordinates for candidate regions... INFO Pre-Augustus scaffold extraction... INFO Running Augustus prediction using anidulans as species: INFO [augustus] Please find all logs related to Augustus here: /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/busco/run_p64phased0_WR6411p/augustus_output/augustus.log INFO 04/06/2020 20:48:52 => 0% of predictions performed (1386 to be done) INFO 04/06/2020 20:50:17 => 10% of predictions performed (153/1386 candidate regions) INFO 04/06/2020 20:51:37 => 20% of predictions performed (292/1386 candidate regions) INFO 04/06/2020 20:53:08 => 30% of predictions performed (430/1386 candidate regions) INFO 04/06/2020 20:54:37 => 40% of predictions performed (569/1386 candidate regions) INFO 04/06/2020 20:56:04 => 50% of predictions performed (707/1386 candidate regions) INFO 04/06/2020 20:57:08 => 60% of predictions performed (846/1386 candidate regions) INFO 04/06/2020 20:58:35 => 70% of predictions performed (986/1386 candidate regions) INFO 04/06/2020 20:59:42 => 80% of predictions performed (1123/1386 candidate regions) INFO 04/06/2020 21:00:58 => 90% of predictions performed (1262/1386 candidate regions) INFO 04/06/2020 21:02:21 => 100% of predictions performed INFO Extracting predicted proteins... INFO ** Step 3/3, current time: 04/06/2020 21:02:39 ** INFO Running HMMER to confirm orthology of predicted proteins: INFO 04/06/2020 21:02:39 => 0% of predictions performed (1361 to be done) INFO 04/06/2020 21:02:42 => 10% of predictions performed (150/1361 candidate proteins) INFO 04/06/2020 21:02:43 => 20% of predictions performed (289/1361 candidate proteins) INFO 04/06/2020 21:02:44 => 30% of predictions performed (422/1361 candidate proteins) INFO 04/06/2020 21:02:45 => 40% of predictions performed (560/1361 candidate proteins) INFO 04/06/2020 21:02:46 => 50% of predictions performed (696/1361 candidate proteins) INFO 04/06/2020 21:02:47 => 60% of predictions performed (831/1361 candidate proteins) INFO 04/06/2020 21:02:49 => 70% of predictions performed (967/1361 candidate proteins) INFO 04/06/2020 21:02:50 => 80% of predictions performed (1104/1361 candidate proteins) INFO 04/06/2020 21:02:52 => 90% of predictions performed (1239/1361 candidate proteins) INFO 04/06/2020 21:02:54 => 100% of predictions performed INFO Results: INFO C:58.8%[S:55.4%,D:3.4%],F:15.9%,M:25.3%,n:1312 INFO 771 Complete BUSCOs (C) INFO 727 Complete and single-copy BUSCOs (S) INFO 44 Complete and duplicated BUSCOs (D) INFO 209 Fragmented BUSCOs (F) INFO 332 Missing BUSCOs (M) INFO 1312 Total BUSCO groups searched

INFO ** Phase 2 of 2, predictions using species specific training ** INFO ** Step 1/3, current time: 04/06/2020 21:02:54 ** INFO Extracting missing and fragmented buscos from the ancestral_variants file... WARNING The busco id(s) ['EOG09262TO9', 'EOG09263KVG', 'EOG09265KPR', 'EOG09265SHM', 'EOG09262ZZ8', 'EOG09264DY4', 'EOG09260289', 'EOG0926300R', 'EOG092656JA', 'EOG092620V3', 'EOG092619L1', 'EOG092629WJ', 'EOG092640PR', 'EOG092629WA', 'EOG09260EPS', 'EOG09262CMP', 'EOG09261Q18', 'EOG09262D4G', 'EOG0926477X', 'EOG09265I60', 'EOG09262QRH', 'EOG09263OAE', 'EOG09263JFQ', 'EOG092653LT', 'EOG092658WS', 'EOG0926577T', 'EOG09264S4C', 'EOG09264GMT', 'EOG09264BOA', 'EOG09265GYD', 'EOG09262XMN', 'EOG09260931', 'EOG09264PK5', 'EOG09260N2T', 'EOG092602OP', 'EOG09261727', 'EOG09262VVF', 'EOG09261O7R', 'EOG09264CND', 'EOG09264YIJ', 'EOG09260FPA', 'EOG09262YQG', 'EOG09263C55', 'EOG09260274', 'EOG09264SET', 'EOG092644N1', 'EOG09264V51', 'EOG09263IQ5', 'EOG09262G8Y', 'EOG092628FW', 'EOG09262W7C', 'EOG09265B4G', 'EOG0926448Q', 'EOG09263OZR', 'EOG09261EAB', 'EOG09261KZ6', 'EOG09264B74', 'EOG09264XKX', 'EOG09261127', 'EOG0926510L', 'EOG09261LEU', 'EOG0926587S', 'EOG09260Z5E', 'EOG092643Y5', 'EOG09263RW3', 'EOG0926009O', 'EOG09261MMR', 'EOG09261I0F', 'EOG09261CQG', 'EOG0926079Q', 'EOG09265KF9', 'EOG09264881', 'EOG09261Q5L', 'EOG092606AJ', 'EOG092605KN', 'EOG09264VC6', 'EOG092606AD', 'EOG09265BE5', 'EOG09261ZJR', 'EOG09263UAJ', 'EOG09265K4K', 'EOG092603EH', 'EOG09261V2P', 'EOG09261JUE', 'EOG09260PI1', 'EOG092636T6', 'EOG09261S0S', 'EOG09263YBT', 'EOG092612MY', 'EOG09262PIH', 'EOG09261TEQ', 'EOG092634MM', 'EOG092654XA', 'EOG09260WGT', 'EOG09261NW2', 'EOG092639H5', 'EOG09263C4C', 'EOG092626HU', 'EOG09261DG0', 'EOG09264HU0', 'EOG09262NIR', 'EOG09264BWL', 'EOG09265JNA', 'EOG09263IMF', 'EOG09264O9F', 'EOG09261IFQ', 'EOG09260EAZ', 'EOG09263A3Y', 'EOG09264UJF', 'EOG09260CMZ', 'EOG09262I0R', 'EOG09261OIF', 'EOG09264R0D', 'EOG092618M2', 'EOG09260QNB', 'EOG092617RY', 'EOG09264903', 'EOG09264904', 'EOG0926499W', 'EOG09261EMF', 'EOG092613UB', 'EOG09264OBO', 'EOG09262IZ6', 'EOG09263F11', 'EOG092606CY', 'EOG09264ZI5', 'EOG09262O0R', 'EOG09260B65', 'EOG09263YUI', 'EOG09264RBX', 'EOG09260LVD', 'EOG09265K1R', 'EOG092600SD', 'EOG09260OZU', 'EOG092628LW', 'EOG092658X5', 'EOG0926051U', 'EOG09261ZPW', 'EOG09260A6Q', 'EOG09261W2O', 'EOG092638CT', 'EOG092604KQ', 'EOG09264HHW', 'EOG09262UB3', 'EOG09265BG5', 'EOG09261V87', 'EOG09260KDB', 'EOG09264WF4', 'EOG09260VTN', 'EOG09261W90', 'EOG092648XW', 'EOG09263R62', 'EOG09260FMW', 'EOG09264XUV', 'EOG09262PMC', 'EOG092610ZY', 'EOG09264PMA', 'EOG092606WZ', 'EOG09262X8R', 'EOG09261ICI', 'EOG092631MU', 'EOG09265I72', 'EOG092631ML', 'EOG09260EE7', 'EOG092605QM', 'EOG09264W1U', 'EOG09263CUQ', 'EOG09263QPR', 'EOG09263720', 'EOG0926310O', 'EOG09264VZ7', 'EOG09262JRP', 'EOG09262TEV', 'EOG09264331', 'EOG09261V9P', 'EOG09263S1E', 'EOG09262GLP', 'EOG092634G9', 'EOG09265BJ3', 'EOG09260NHB', 'EOG092643JW', 'EOG09265I7S', 'EOG092605VU', 'EOG09262OLP', 'EOG09265040', 'EOG09262H34', 'EOG09262U7S', 'EOG09265K5D', 'EOG09264XVU', 'EOG09260H81', 'EOG09260SJV', 'EOG09260NAN', 'EOG0926431P', 'EOG09261I8J', 'EOG09263M8W', 'EOG09260RNZ', 'EOG09262Z2S', 'EOG092648O6', 'EOG092644DY', 'EOG09260LRX', 'EOG09262LI4', 'EOG092646PE', 'EOG09261QR8', 'EOG09262FJB', 'EOG09260A27', 'EOG09262SWJ', 'EOG09265LIB', 'EOG0926025H', 'EOG09264ZDJ', 'EOG0926229Z', 'EOG0926142Y', 'EOG09260XQV', 'EOG09263CG2', 'EOG092626EQ', 'EOG09263NXM', 'EOG09260075', 'EOG09260RRC', 'EOG09261UMG', 'EOG092643NE', 'EOG09264L6D', 'EOG09262ESR', 'EOG09265E8A', 'EOG09264IPL', 'EOG0926457R', 'EOG09265DRM', 'EOG09260WUS', 'EOG0926077L', 'EOG0926047G', 'EOG09265313', 'EOG09262E7W', 'EOG09265PJ3', 'EOG09261T6U', 'EOG09262QL6', 'EOG092617S2', 'EOG092618UZ', 'EOG09260AQB', 'EOG092613R2', 'EOG09260BRA', 'EOG0926049S', 'EOG092654VM', 'EOG09261DJC', 'EOG09261I1G', 'EOG09261ACJ', 'EOG09260QVP', 'EOG09264FVQ', 'EOG0926195C', 'EOG09260OE9', 'EOG09264LJU', 'EOG09261EU7', 'EOG092653NM', 'EOG092604ZZ', 'EOG09264SQJ', 'EOG09263H5H', 'EOG09264XM2', 'EOG092620FM', 'EOG09263Z41', 'EOG09264E6Z', 'EOG09261N20', 'EOG09263LP3', 'EOG092649XV', 'EOG092645L9', 'EOG09264RJL', 'EOG0926129I', 'EOG09263L7Y', 'EOG09264J8E', 'EOG092648K5', 'EOG09260KCB', 'EOG09260AZA', 'EOG09261YGB', 'EOG09260AZK', 'EOG09261RU1', 'EOG09260DP1', 'EOG09263I7I', 'EOG092634ZL', 'EOG09261MPU', 'EOG092609RF', 'EOG09260OQ8', 'EOG092634B5', 'EOG09263JZO', 'EOG09265PQX', 'EOG09260OLB', 'EOG09264CH7', 'EOG09263XVS', 'EOG0926137U', 'EOG09260KUC', 'EOG09262HP3', 'EOG0926315C', 'EOG092625U6', 'EOG09264V2U', 'EOG09261IEH', 'EOG09263ZW6', 'EOG09260W52', 'EOG09260TPT', 'EOG092610LP', 'EOG09262739', 'EOG0926158Y', 'EOG09260LI6', 'EOG09264I6B', 'EOG09262SR7', 'EOG092650I8', 'EOG09262X74', 'EOG09261HZD', 'EOG09263K05', 'EOG09261ABB', 'EOG09263AZP', 'EOG092614DJ', 'EOG09264NNY', 'EOG092619GP', 'EOG09264IV9', 'EOG09261HQU', 'EOG092606UX', 'EOG09261YRA', 'EOG09262E3Q', 'EOG092643IE', 'EOG0926374A', 'EOG092644WX', 'EOG092604A0', 'EOG09265K60', 'EOG092647CM', 'EOG09261WJ8', 'EOG09263GSR', 'EOG09261XZ6', 'EOG09263J7D', 'EOG09264TJN', 'EOG09262X01', 'EOG092641M3', 'EOG09260ERO', 'EOG09264AWW', 'EOG092619NK', 'EOG09260NNR', 'EOG09264VRO', 'EOG09262X7T', 'EOG09260EZT', 'EOG092654O3', 'EOG09260THV', 'EOG09264G0H', 'EOG09260KGS', 'EOG09263BG5', 'EOG092617AN', 'EOG09264T1U', 'EOG09265GGX', 'EOG09264KTU', 'EOG092619RJ', 'EOG092632TF', 'EOG09260W9L', 'EOG09260B3X', 'EOG09262WQX', 'EOG092653VU', 'EOG09264W7W', 'EOG092640ET', 'EOG09262516', 'EOG09265RGI', 'EOG09260NY2', 'EOG092651FJ', 'EOG09262LQT', 'EOG0926384F', 'EOG0926251E', 'EOG09265M8S', 'EOG092658CI', 'EOG09264R1U', 'EOG09264W71', 'EOG0926133I', 'EOG09261DRB', 'EOG092640WA', 'EOG09262BCZ', 'EOG09264O2D', 'EOG09262MFL', 'EOG09260DUW', 'EOG09262HKC', 'EOG09260SAH', 'EOG09262341', 'EOG09263CAH', 'EOG092658QH', 'EOG09260EQD', 'EOG09260P2K', 'EOG09260JJW', 'EOG092621X3', 'EOG09261XGL', 'EOG09265AT5', 'EOG09264719', 'EOG09263HYJ', 'EOG09262KXK', 'EOG09263G4R', 'EOG09261IOS', 'EOG09262KZ3', 'EOG092648U2', 'EOG09262645', 'EOG09261JWS', 'EOG09264QRY', 'EOG09264KSI', 'EOG092638AP', 'EOG092658ZO', 'EOG0926578E', 'EOG09264B2P', 'EOG09262QS5', 'EOG09260WG2', 'EOG09264NJ1', 'EOG092651OF', 'EOG09262QTY', 'EOG09264Y0W', 'EOG092620U5', 'EOG09264LBC', 'EOG09260E2O', 'EOG09260P5R', 'EOG09261DGU', 'EOG09260OCI', 'EOG09262KJA', 'EOG09261FAB', 'EOG09265A4E', 'EOG09265QTV', 'EOG09261666', 'EOG092619VG', 'EOG092651HW', 'EOG09263FAK', 'EOG09265RGS', 'EOG092636Y6', 'EOG092646C6', 'EOG09264XJC', 'EOG09264FWI', 'EOG092646CB', 'EOG09264V30', 'EOG09264LH2', 'EOG0926306O', 'EOG09261L4M', 'EOG09264RR2', 'EOG09261FH7', 'EOG09264FXB', 'EOG09265LEG', 'EOG09265JA7', 'EOG092615CC', 'EOG09260M44', 'EOG09262V9N', 'EOG09265GSM', 'EOG09265HPJ', 'EOG09263Z8I', 'EOG09262D0D', 'EOG09264X31', 'EOG09264HX6', 'EOG0926390Q', 'EOG092630UB', 'EOG092608L0', 'EOG092600X0', 'EOG09264829', 'EOG09261NLY', 'EOG092655SO', 'EOG092646VF', 'EOG09263P17', 'EOG092652TN', 'EOG09260J97', 'EOG09260JDM', 'EOG092604I8', 'EOG09260NHN', 'EOG092624X0', 'EOG092618J9', 'EOG09265KNL', 'EOG092608AE', 'EOG09265GXF', 'EOG092612J3', 'EOG09265PWR', 'EOG09261BPJ', 'EOG09262DPL', 'EOG092638EN', 'EOG092645U1', 'EOG09262GWQ', 'EOG09264DYD', 'EOG09263L9T', 'EOG09260K4V', 'EOG09264COX', 'EOG09261KRX', 'EOG09260K4B', 'EOG09260M87', 'EOG09260AEC', 'EOG09262POL', 'EOG09264NDD', 'EOG09265CQO', 'EOG09260SR2', 'EOG09264L06', 'EOG09263A5D', 'EOG092624JL', 'EOG09260NXJ', 'EOG09263VOP', 'EOG09262PZ9', 'EOG09265B1X', 'EOG09261IEV', 'EOG092608WU', 'EOG09260IU8', 'EOG09264UKL', 'EOG09262GXD', 'EOG09262N10', 'EOG092624SJ', 'EOG09262XGK', 'EOG09264K2W', 'EOG0926425H', 'EOG092655L0', 'EOG09262PAY', 'EOG09264X41', 'EOG092643QW', 'EOG09264TQ5', 'EOG09264XOZ', 'EOG092643QM', 'EOG092654QZ', 'EOG09263WLB', 'EOG092651K1', 'EOG09265M98', 'EOG0926369X', 'EOG092611HB', 'EOG09261SS1', 'EOG092610QT', 'EOG09263DFA', 'EOG09260NZ8', 'EOG09264KK7', 'EOG09262KN8', 'EOG09264XTW', 'EOG09263J6Z', 'EOG09264PDD', 'EOG092649VA', 'EOG09260U6R', 'EOG092607QZ', 'EOG09264C3V', 'EOG09263OWL', 'EOG09262TUR', 'EOG09261404', 'EOG09262SI7', 'EOG09265C25', 'EOG09264XF3', 'EOG092621GA', 'EOG09265B95', 'EOG092604S1', 'EOG092605ZA', 'EOG09260375', 'EOG092635SS', 'EOG092635ST', 'EOG09261MG9', 'EOG09264IKZ', 'EOG09260KNR', 'EOG09264873', 'EOG09263X4B', 'EOG09260C2V', 'EOG09262OX9', 'EOG09261LV7', 'EOG09261476'] were not found in the ancestral_variants file INFO Running tblastn, writing output to /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/busco/run_p64phased0_WR6411p/blast_output/tblastn_p64phased0_WR6411p_missing_and_frag_rerun.tsv... INFO [tblastn] Warning: [tblastn] Query is Empty! INFO Getting coordinates for candidate regions... INFO ** Step 2/3, current time: 04/06/2020 21:02:55 ** INFO Training Augustus using Single-Copy Complete BUSCOs: INFO 04/06/2020 21:02:55 => Converting predicted genes to short genbank files... INFO 04/06/2020 21:03:51 => All files converted to short genbank files, now running the training scripts... INFO Pre-Augustus scaffold extraction... INFO Re-running Augustus with the new metaparameters, number of target BUSCOs: 541 INFO 04/06/2020 21:03:54 => 0% of predictions performed (0 to be done) INFO 04/06/2020 21:03:54 => 100% of predictions performed INFO Extracting predicted proteins... INFO ** Step 3/3, current time: 04/06/2020 21:03:54 ** INFO Running HMMER to confirm orthology of predicted proteins: INFO 04/06/2020 21:03:54 => 0% of predictions performed (0 to be done) INFO 04/06/2020 21:03:54 => 100% of predictions performed INFO Results: INFO C:58.8%[S:55.4%,D:3.4%],F:15.9%,M:25.3%,n:1312 INFO 771 Complete BUSCOs (C) INFO 727 Complete and single-copy BUSCOs (S) INFO 44 Complete and duplicated BUSCOs (D) INFO 209 Fragmented BUSCOs (F) INFO 332 Missing BUSCOs (M) INFO 1312 Total BUSCO groups searched

INFO BUSCO analysis done with WARNING(s). Total running time: 1059.354913 seconds INFO Results written in /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/busco/run_p64phased0_WR6411p/

INFO ** Start a BUSCO 2.0 analysis, current time: 04/06/2020 21:08:48 ** INFO The lineage dataset is: dikarya_odb9 (eukaryota) INFO Mode is: proteins INFO To reproduce this run: python /home/jean2573/.conda/envs/funannotate174/lib/python2.7/site-packages/funannotate/aux_scripts/funannotate-BUSCO2.py -i /scratch/RDS-FSC-20p-RW/jeannie/runningfiles/documents/2020/document_hc_hybrid/allhc_monoploid/default_configuration/analysis_files/funanno/predict_misc/busco.evm.proteins.fa -o p64phased0_WR6411p -l /scratch/WR6411/jeanniehome/tools/funannotate_database/dikarya/ -m proteins -c 8 -sp anidulans INFO Check dependencies... INFO Check input file... INFO Temp directory is ./tmp/ INFO Running HMMER on the proteins: INFO 04/06/2020 21:08:48 => 0% of predictions performed (1312 to be done) INFO 04/06/2020 21:08:51 => 10% of predictions performed (145/1312 candidate proteins) INFO 04/06/2020 21:08:52 => 20% of predictions performed (276/1312 candidate proteins) INFO 04/06/2020 21:08:53 => 30% of predictions performed (407/1312 candidate proteins) INFO 04/06/2020 21:08:54 => 40% of predictions performed (538/1312 candidate proteins) INFO 04/06/2020 21:08:55 => 50% of predictions performed (671/1312 candidate proteins) INFO 04/06/2020 21:08:57 => 60% of predictions performed (802/1312 candidate proteins) INFO 04/06/2020 21:08:58 => 70% of predictions performed (933/1312 candidate proteins) INFO 04/06/2020 21:08:59 => 80% of predictions performed (1064/1312 candidate proteins) INFO 04/06/2020 21:09:01 => 90% of predictions performed (1195/1312 candidate proteins)

nextgenusfs commented 4 years ago

What version of tblastn are you running? I noticed this in the log file:

INFO    [tblastn]   Warning: [tblastn] Query is Empty!

Some version of tblastn multithreading is unstable/broken, so perhaps it is dying sporadically and that is reason that it seems to be dying at different points in the run.

Jennie17 commented 4 years ago

Hi @nextgenusfs,

I used this command to check $ funannotate check --show-versions

I got the output: tblastn: tblastn 2.2.31+

If this version is not compatible, which command shall I use to install the working version of tblastn?

BTW, I also got error msg for external dependencies.

Are they necessary? Which part of the pipeline will these 4 dependencies affect?

ERROR: emapper.py not installed ERROR: ete3 not installed ERROR: gmes_petap.pl not installed ERROR: signalp not installed

Thanks!

sklcusa commented 4 years ago

Hi @Jennie17 , I have the same tblastn version, and use the protein sequence for training, so, I do not have this issue. The emapper.py and ete3 are easy installed via bioconda.

Jennie17 commented 4 years ago

Hi @sklcusa,

Thanks for your comments.

I don't mind downgrading the version of tblastn and give it a try.

Do you know how to do it via conda?

For the current failing situation, is there something else I could try to make it work?

sklcusa commented 4 years ago

@Jennie17 , You are welcome. Just search “ bioconda install ete3” or "emapper" with google, lots of instructions. Suggest create different virtual envs with "conda create -n envs_name" before run "conda install basename". If you want to change tblastn, download an old version binary file, found the location in your computer, and replace the file, that is fast easy way.

nextgenusfs commented 4 years ago

Hi @Jennie17 your version of tblastn is fine -- that one is old and multi-threading works fine in that version. On your system are you able to run the test datasets to completion, ie funannotate test -t busco --cpus X should test your install to make sure all is working. This will run BUSCO mediated training on a smaller dataset which seems to be what you are trying to run.

The missing dependencies you listed are optional.

Jennie17 commented 4 years ago

Hi @nextgenusfs,

I've tried the command you suggested and the output is as below.

I noted there is one message about SNP failed [02:38 PM]: SNAP prediction failed, moving on without result

Will this part cause the problem as it also towards the end of busco prediction like what I got for my real data?

$ funannotate test -t busco ######################################################### Running funannotate predict BUSCO-mediated training unit testing Downloading: https://osf.io/te2pf/download?version=1 Bytes: 1489808 CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --cpus 2 --species Awesome busco #########################################################

[02:14 PM]: OS: linux2, 24 cores, ~ 132 GB RAM. Python: 2.7.15 [02:14 PM]: Running funannotate v1.7.4 [02:14 PM]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction. [02:14 PM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus busco glimmerhmm busco snap busco [02:14 PM]: CodingQuarry will be skipped --> --rna_bam required for training [02:14 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [02:14 PM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked [02:14 PM]: Mapping 1,065 proteins to genome using diamond and exonerate [02:15 PM]: Found 1,774 preliminary alignments --> aligning with exonerate [02:15 PM]: Exonerate finished: found 1,345 alignments [02:15 PM]: Running BUSCO to find conserved gene models for training ab-initio predictors [02:34 PM]: 373 valid BUSCO predictions found, now formatting for EVM
[02:34 PM]: Running EVM commands with 1 CPUs
[02:35 PM]: Converting to GFF3 and collecting all EVM results
[02:35 PM]: 365 total gene models from EVM, now validating with BUSCO HMM search
[02:36 PM]: 365 BUSCO predictions validated
[02:36 PM]: Training Augustus using BUSCO gene models
[02:36 PM]: Augustus initial training results:
Feature Specificity Sensitivity
nucleotides 99.3% 84.0%
exons 69.6% 56.5%
genes 82.0% 55.8%
[02:36 PM]: Running Augustus gene prediction using awesome_busco parameters
[02:38 PM]: 1,393 predictions from Augustus
[02:38 PM]: Pulling out high quality Augustus predictions
[02:38 PM]: Found 320 high quality predictions from Augustus (>90% exon evidence)
[02:38 PM]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[02:38 PM]: 0 predictions from SNAP
[02:38 PM]: SNAP prediction failed, moving on without result
[02:38 PM]: Running GlimmerHMM gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[02:39 PM]: 1,773 predictions from GlimmerHMM
[02:39 PM]: Summary of gene models passed to EVM (weights):
Source Weight Count
Augustus 1 1073
Augustus HiQ 2 320
GlimmerHMM 1 1773
Total - 3166
[02:40 PM]: Running EVM commands with 1 CPUs
[02:44 PM]: Converting to GFF3 and collecting all EVM results
[02:44 PM]: 1,705 total gene models from EVM
[02:44 PM]: Generating protein fasta files from 1,705 EVM models
[02:44 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[02:44 PM]: Found 154 gene models to remove: 0 too short; 0 span gaps; 224 transposable elements
[02:44 PM]: 1,551 gene models remaining
[02:44 PM]: Predicting tRNAs
[02:45 PM]: 105 tRNAscan models are valid (non-overlapping)
[02:45 PM]: Generating GenBank tbl annotation file
[02:45 PM]: Converting to final Genbank format
[02:45 PM]: Collecting final annotation files for 1,656 total gene models
[02:45 PM]: Funannotate predict is finished, output files are in the annotate/predict_results folder
[02:45 PM]: Your next step might be functional annotation, suggested commands:


Run InterProScan (Docker required):
funannotate iprscan -i annotate -m docker -c 2

Run antiSMASH: funannotate remote -i annotate -m antismash -e youremail@server.edu

Annotate Genome: funannotate annotate -i annotate --cpus 2 --sbt yourSBTfile.txt

[02:45 PM]: Training parameters file saved: annotate/predict_results/awesome_busco.parameters.json [02:45 PM]: Add species parameters to database:

funannotate species -s awesome_busco -a annotate/predict_results/awesome_busco.parameters.json

######################################################### SUCCESS: funannotate predict BUSCO-mediated training test complete. ######################################################### Now running predict using all pre-trained ab-initio predictors CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate2 --cpus 2 --species Awesome busco -p annotate/predict_results/awesome_busco.parameters.json #########################################################

[02:45 PM]: OS: linux2, 24 cores, ~ 132 GB RAM. Python: 2.7.15 [02:45 PM]: Running funannotate v1.7.4 [02:45 PM]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction. [02:46 PM]: Ab initio training parameters file passed: annotate/predict_results/awesome_busco.parameters.json [02:46 PM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus pretrained glimmerhmm pretrained snap busco [02:46 PM]: CodingQuarry will be skipped --> --rna_bam required for training [02:46 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [02:46 PM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked [02:46 PM]: Mapping 1,065 proteins to genome using diamond and exonerate [02:46 PM]: Found 1,774 preliminary alignments --> aligning with exonerate [02:47 PM]: Exonerate finished: found 1,345 alignments [02:47 PM]: Running BUSCO to find conserved gene models for training ab-initio predictors [03:04 PM]: 373 valid BUSCO predictions found, now formatting for EVM
[03:05 PM]: Running EVM commands with 1 CPUs
[03:06 PM]: Converting to GFF3 and collecting all EVM results
[03:06 PM]: 365 total gene models from EVM, now validating with BUSCO HMM search
[03:06 PM]: 365 BUSCO predictions validated
[03:06 PM]: Running Augustus gene prediction using awesome_busco parameters
[03:08 PM]: 1,393 predictions from Augustus
[03:08 PM]: Pulling out high quality Augustus predictions
[03:08 PM]: Found 320 high quality predictions from Augustus (>90% exon evidence)
[03:08 PM]: Running SNAP gene prediction, using training data: annotate2/predict_misc/busco.final.gff3
[03:08 PM]: 0 predictions from SNAP
[03:08 PM]: SNAP prediction failed, moving on without result
[03:08 PM]: Running GlimmerHMM gene prediction, using pretrained HMM profile
[03:08 PM]: 1,773 predictions from GlimmerHMM
[03:08 PM]: Summary of gene models passed to EVM (weights):
Source Weight Count
Augustus 1 1073
Augustus HiQ 2 320
GlimmerHMM 1 1773
Total - 3166
[03:08 PM]: Running EVM commands with 1 CPUs
[03:13 PM]: Converting to GFF3 and collecting all EVM results [03:13 PM]: 1,704 total gene models from EVM [03:13 PM]: Generating protein fasta files from 1,704 EVM models [03:13 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [03:13 PM]: Found 154 gene models to remove: 0 too short; 0 span gaps; 224 transposable elements [03:13 PM]: 1,550 gene models remaining [03:13 PM]: Predicting tRNAs [03:14 PM]: 106 tRNAscan models are valid (non-overlapping) [03:14 PM]: Generating GenBank tbl annotation file [03:14 PM]: Converting to final Genbank format [03:14 PM]: Collecting final annotation files for 1,656 total gene models [03:14 PM]: Funannotate predict is finished, output files are in the annotate2/predict_results folder [03:14 PM]: Your next step might be functional annotation, suggested commands:

Run InterProScan (Docker required): funannotate iprscan -i annotate2 -m docker -c 2

Run antiSMASH: funannotate remote -i annotate2 -m antismash -e youremail@server.edu

Annotate Genome: funannotate annotate -i annotate2 --cpus 2 --sbt yourSBTfile.txt

[03:14 PM]: Training parameters file saved: annotate2/predict_results/awesome_busco.parameters.json [03:14 PM]: Add species parameters to database:

funannotate species -s awesome_busco -a annotate2/predict_results/awesome_busco.parameters.json

######################################################### SUCCESS: funannotate predict using existing parameters test complete. #########################################################

Jennie17 commented 4 years ago

Hi @sklcusa,

Thanks for your suggestion.