Closed zctea closed 6 years ago
Sounds like repeatmasker or repeatmodeler is not properly installed, check the logiles for some clues as to why.
Thanks for your quick response!
I installed RepBase before running configure. RepeatMasker (4.0.7) has generated an empty RepeatMasker.lib file. Appears the same issue: https://github.com/rmhubley/RepeatModeler/issues/10
Congratulations! RepeatMasker is now ready to use. The program is installed with a the following repeat libraries: Dfam database version Dfam_2.0 RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127 Further documentation on the program may be found here: /home/zctea/biosoft/RepeatMasker/repeatmasker.help
WARNING: /home/zctea/biosoft/RepeatMasker/Libraries/RepeatMasker.lib.[n??] doesn't exist! RepeatModeler will not run correctly without these files. Please re-run /home/zctea/biosoft/RepeatMasker/configure to create these files automatically. Then re-run this script.
I tried to use the script called buildRMLibFromEMBL.pl
in the util directory of RepeatMasker to convert RepeatMaskerLib.embl
to fasta fromate RepeatMasker.lib
, and run makeblastdb on the RepeatMasker.lib file.
buildRMLibFromEMBL.pl RepeatMaskerLib.embl > RepeatMasker.lib
makeblastdb -in RepeatMasker.lib -out RepeatMasker.lib -dbtype nucl -parse_seqids
However, it failed with parser error FASTA-Reader: Ignoring invalid residues at position(s): On line 2474091: 1-50 FASTA-Reader: Ignoring invalid residues at position(s): On line 2474092: 1-50 FASTA-Reader: Ignoring invalid residues at position(s): On line 2474093: 1-50 FASTA-Reader: Ignoring invalid residues at position(s): On line 2474094: 1-50 FASTA-Reader: Ignoring invalid residues at position(s): On line 2474095: 1-19 FASTA-Reader: Ignoring invalid residues at position(s): On line 2488736: 4 Adding sequences from FASTA; added 45447 sequences in 4.1424 seconds.
Relevant issue: https://www.biostars.org/p/63444/ https://www.biostars.org/p/202806/
Did you try to re-download the RepBase database, untar, and then move into the Libraries folder of RepeatMasker, then re-run the configure scripts from the RepeatMasker base directory, specify rmblast as default. Other than that I don't know what else it could be? What version of blast are you running, it does use makeblastdb
from blast+ package.
Thank you for your kindly help!
You are right, I specify HMMER3.1 & DFAM
as default during configuring the RepeatMasker, and it cause the error.
I ran makeblastdb
(BLAST 2.2.31+) on RepeatMasker.lib. Although makeblastdb
process output the parser error, RepeatModeler still installed fine after this.
I run the following command:
funannotate predict -i TAIR10_genome.fasta --species "Arabidopsis thaliana" --organism other --busco_db embryophyta --busco_seed_species arabidopsis --cpus 28 --transcript_evidence TAIR10_cdna.fasta -o predict
The following is the log and output files of funannotate predict run: Why the Evidence modeler was failed?
[05/26/18 16:44:28]: /home/zctea/biosoft/funannotate/bin/funannotate-predict.py -i TAIR10_genome.fasta --species Arabidopsis thaliana --organism other --busco_db embryophyta --busco_seed_species arabidopsis --cpus 28 --transcript_evidence TAIR10_cdna.fasta -o predict
[05/26/18 16:44:28]: OS: linux2, 32 cores, ~ 182 GB RAM. Python: 2.7.15
[05/26/18 16:44:28]: Running funannotate v1.3.3
[05/26/18 16:44:30]: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER and BUSCO
[05/26/18 16:44:43]: Masked genome: 5 scaffolds; 119,146,348 bp; 14.85% repeats masked
[05/26/18 16:44:43]: Existing transcript alignments found: predict/predict_misc/transcript_alignments.gff3
[05/26/18 16:44:58]: Existing Exonerate alignments found: predict/predict_misc/exonerate.out
[05/26/18 16:45:00]: /home/zctea/biosoft/augustus/scripts/exonerate2hints.pl --in=predict/predict_misc/exonerate.out --out=predict/predict_misc/hints.P.gff --minintronlen=10 --maxintronlen=3000
[05/26/18 16:45:20]: perl /home/zctea/biosoft/augustus/scripts/join_mult_hints.pl
[05/26/18 16:45:22]: Running GeneMark-ES on assembly
[05/26/18 16:45:22]: gmes_petap.pl --ES --max_intron 3000 --soft_mask 5000 --cores 28 --sequence /home/zctea/biowork/TAIR10/predict/predict_misc/genome.softmasked.fa
[05/26/18 17:01:14]: (None, '')
[05/26/18 17:01:14]: Converting GeneMark GTF file to GFF3
[05/26/18 17:01:19]: perl /home/zctea/biosoft/evidencemodeler/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl predict/predict_misc/genemark.gff
[05/26/18 17:01:19]: Found 30,705 gene models
[05/26/18 17:01:19]: Running BUSCO to find conserved gene models for training Augustus
[05/26/18 17:32:26]: 1,401 valid BUSCO predictions found, now formatting for EVM
[05/26/18 17:33:22]: /home/zctea/biosoft/funannotate/util/fix_busco_naming.py predict/predict_misc/busco_augustus.tmp predict/predict_misc/busco/run_arabidopsis_thaliana/full_table_arabidopsis_thaliana.tsv predict/predict_misc/busco_augustus.gff3
[05/26/18 17:33:22]: bedtools intersect -a predict/predict_misc/genemark.evm.gff3 -b predict/predict_misc/buscos.bed
[05/26/18 17:33:22]: bedtools intersect -a /home/zctea/biowork/TAIR10/predict/predict_misc/transcript_alignments.gff3 -b predict/predict_misc/buscos.bed
[05/26/18 17:33:22]: bedtools intersect -a /home/zctea/biowork/TAIR10/predict/predict_misc/protein_alignments.gff3 -b predict/predict_misc/buscos.bed
[05/26/18 17:35:56]: Evidence modeler has failed, exiting
-----------------------
 
zctea@zctea ~/b/TAIR10> funannotate predict -i TAIR10_genome.fasta --species "Arabidopsis thaliana" --organism other --busco_db embryophyta --busco_seed_species arabidopsis --cpus 28 --transcript_evidence TAIR10_cdna.fasta -o predict
-------------------------------------------------------
[08:31 PM]: OS: linux2, 32 cores, ~ 182 GB RAM. Python: 2.7.15
[08:31 PM]: Running funannotate v1.3.3
[08:31 PM]: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER and BUSCO
[08:32 PM]: Masked genome: 5 scaffolds; 119,146,348 bp; 14.85% repeats masked
[08:32 PM]: Existing transcript alignments found: predict/predict_misc/transcript_alignments.gff3
[08:32 PM]: Existing Exonerate alignments found: predict/predict_misc/exonerate.out
[08:32 PM]: Existing GeneMark annotation found: predict/predict_misc/genemark.gff
[08:32 PM]: Found 30,705 gene models
[08:32 PM]: Running BUSCO to find conserved gene models for training Augustus
[09:03 PM]: 1,401 valid BUSCO predictions found, now formatting for EVM
[09:04 PM]: Setting up EVM partitions
[09:06 PM]: Generating EVM command list
[09:06 PM]: Running EVM commands with 27 CPUs
[09:07 PM]: Combining partitioned EVM outputs
[09:07 PM]: Converting EVM output to GFF3
[09:07 PM]: Collecting all EVM results
[09:07 PM]: Evidence modeler has failed, exiting
-----------------
 
drwxr-xr-x 4 zctea zctea 4.0K 5月 26 17:01 busco/
-rw-r--r-- 1 zctea zctea 0 5月 26 17:33 busco_augustus.gff3
-rw-r--r-- 1 zctea zctea 0 5月 26 17:32 busco_augustus.tmp
-rw-r--r-- 1 zctea zctea 0 5月 26 17:33 busco_genemark.gff3
-rw-r--r-- 1 zctea zctea 0 5月 26 17:33 busco_predictions.gff3
-rw-r--r-- 1 zctea zctea 0 5月 26 17:33 busco_proteins.gff3
-rw-r--r-- 1 zctea zctea 48K 5月 26 17:32 buscos.bed
-rw-r--r-- 1 zctea zctea 0 5月 26 17:33 busco_transcripts.gff3
-rw-r--r-- 1 zctea zctea 99 5月 26 17:33 busco_weights.txt
-rw-r--r-- 1 zctea zctea 30M 5月 26 00:22 exonerate.out
drwxr-xr-x 6 zctea zctea 4.0K 5月 26 17:01 genemark/
-rw-r--r-- 1 zctea zctea 0 5月 26 17:01 genemark.evm.gff3
-rw-r--r-- 1 zctea zctea 0 5月 26 17:01 genemark.evm.gff3.bak
-rw-r--r-- 1 zctea zctea 36M 5月 26 17:01 genemark.gff
-rw-r--r-- 1 zctea zctea 0 5月 26 17:01 genemark.temp.gff
-rw-r--r-- 1 zctea zctea 116M 5月 26 16:44 genome.fasta
-rw-r--r-- 1 zctea zctea 116M 5月 25 22:05 genome.softmasked.fa
-rw-r--r-- 1 zctea zctea 898K 5月 26 16:59 gmhmm.mod
-rw-r--r-- 1 zctea zctea 14M 5月 26 16:45 hints.ALL.gff
-rw-r--r-- 1 zctea zctea 24M 5月 26 16:45 hints.all.sort.tmp
-rw-r--r-- 1 zctea zctea 24M 5月 26 16:45 hints.all.tmp
-rw-r--r-- 1 zctea zctea 17M 5月 25 22:05 hints.M.gff
-rw-r--r-- 1 zctea zctea 6.8M 5月 26 16:45 hints.P.gff
-rw-r--r-- 1 zctea zctea 62M 5月 26 00:22 p2g.diamond.out
drwxrwxr-x 5 zctea zctea 4.0K 5月 26 15:39 predict/
-rw-r--r-- 1 zctea zctea 6.3M 5月 26 16:45 protein_alignments.gff3
-rw-r--r-- 1 zctea zctea 6.3M 5月 26 16:44 protein_alignments.gff3.old
-rw-r--r-- 1 zctea zctea 198M 5月 26 16:44 proteins.combined.fa
-rw-r--r-- 1 zctea zctea 198M 5月 26 16:44 proteins.combined.fa.old
drwxr-xr-x 2 zctea zctea 4.0K 5月 25 22:04 RepeatMasker/
-rw-r--r-- 1 zctea zctea 0 5月 25 22:05 repeatmasker.gff3
drwxr-xr-x 3 zctea zctea 36K 5月 25 21:55 RepeatModeler/
-rw-r--r-- 1 zctea zctea 553K 5月 25 21:55 repeatmodeler.lib.fa
-rw-r--r-- 1 zctea zctea 29 5月 26 16:44 scaffold.sort.order.txt
-rw-r--r-- 1 zctea zctea 40 5月 26 16:44 scaffold.sort.rename.txt
-rw-r--r-- 1 zctea zctea 12M 5月 25 22:05 transcript_alignments.gff3
-rw-r--r-- 1 zctea zctea 12M 5月 26 16:44 transcript_alignments.gff3.old
-rw-r--r-- 1 zctea zctea 12M 5月 25 22:05 transcript_minimap2.gff3
-rw-r--r-- 1 zctea zctea 89M 5月 25 22:05 transcripts.combined.fa
-rw-r--r-- 1 zctea zctea 24M 5月 25 22:05 transcripts.minimap2.bam
I am a beginner of bioinformatics. I would appreciate it if you could give me some guidance on how to adjust the options to fit the pipeline for non-model plant genome annotation.
Taking Camellia sinensis for example:
funannotate predict \
-i genome.fasta \
-o predict \
--species "Camellia sinensis" \
--organism other \
--busco_db embryophyta \
--busco_seed_species ?*** (non-model plant) \
--augustus_species ?*** (non-model plant) \
--optimize_augustus ?*** \
--transcript_evidence trinity.fasta \
--rna_bam alignments.bam \
--protein_evidence uniprot.fa \
--cpus 28
Did you check the EVM log file? Perhaps a missing perl dependency?? Generally your command looks good, a few suggestions:
You can check which species are available (pre-trained) using the funannotate species
command.
$ funannotate species
--------------------------
AUGUSTUS species options:
--------------------------
Conidiobolus_coronatus cryptococcus pfalciparum
E_coli_K12 cryptococcus_neoformans_gattii phanerochaete_chrysosporium
Xipophorus_maculatus cryptococcus_neoformans_neoformans_B pichia_stipitis
adorsata cryptococcus_neoformans_neoformans_JEC21 pneumocystis
aedes culex rhizopus_oryzae
amphimedon debaryomyces_hansenii rhodnius
ancylostoma_ceylanicum elegans rice
anidulans elephant_shark rubeus_macgubis
arabidopsis encephalitozoon_cuniculi_GB s_aureus
aspergillus_fumigatus eremothecium_gossypii s_pneumoniae
aspergillus_nidulans fly saccharomyces
aspergillus_oryzae fusarium saccharomyces_cerevisiae_S288C
aspergillus_terreus fusarium_graminearum saccharomyces_cerevisiae_rm11-1a_1
b_pseudomallei galdieria schistosoma
bombus_impatiens1 generic schistosoma2
bombus_terrestris2 heliconius_melpomene1 schizosaccharomyces_pombe
botrytis_cinerea histoplasma seahare
brugia histoplasma_capsulatum sulfolobus_solfataricus
c_elegans_trsk honeybee1 template_prokaryotic
cacao human tetrahymena
caenorhabditis kluyveromyces_lactis thermoanaerobacter_tengcongensis
camponotus_floridanus laccaria_bicolor tomato
candida_albicans lamprey toxoplasma
candida_guilliermondii leishmania_tarentolae tribolium2012
candida_tropicalis lodderomyces_elongisporus trichinella
chaetomium_globosum magnaporthe_grisea ustilago
chicken maize ustilago_maydis
chlamy2011 maize5 verticillium_albo_atrum1
chlamydomonas nasonia verticillium_longisporum1
chlorella neurospora wheat
coccidioides_immitis neurospora_crassa yarrowia_lipolytica
coprinus parasteatoda zebrafish
coprinus_cinereus pchrysosporium
coyote_tobacco pea_aphid
So you can use --busco_seed_species arabidopsis
or perhaps a more closely related plant genome (if one exists). Leave the --augustus_species
option off of the command -- this will enforce funannotate to automatically train for your genome using the --busco_seed_species
parameters to run BUSCO and then use those gene models to train Augustus. You may also want to adjust the --min_intronlen
parameter to something higher (default is for fungi).
If you have RNA-seq, you can also utilize the funannotate train
command which will run genome guided Trinity and PASA for you.
I am an absolute beginner in bioinformatics, how can I solve the problem?
zctea@zctea:~/biowork/TAIR10$ funannotate predict -i TAIR10_genome.fasta --species "Arabidopsis thaliana" --busco_db embryophyta --cpus 28 --transcript_evidence TAIR10_cdna.fasta -o predict