nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

UnboundLocalError: local variable 'RP_folder' referenced before assignment #168

Closed zctea closed 6 years ago

zctea commented 6 years ago

I am an absolute beginner in bioinformatics, how can I solve the problem?

zctea@zctea:~/biowork/TAIR10$ funannotate predict -i TAIR10_genome.fasta --species "Arabidopsis thaliana" --busco_db embryophyta --cpus 28 --transcript_evidence TAIR10_cdna.fasta -o predict


[09:14 AM]: OS: linux2, 32 cores, ~ 182 GB RAM. Python: 2.7.15 [09:14 AM]: Running funannotate v1.3.3 [09:15 AM]: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER and BUSCO [09:15 AM]: Loading sequences and soft-masking genome [09:15 AM]: Soft-masking: building RepeatModeler database [09:15 AM]: Soft-masking: generating repeat library using RepeatModeler Traceback (most recent call last): File "/home/zctea/biosoft/funannotate/bin/funannotate-predict.py", line 367, in lib.RepeatModelMask(Genome, args.cpus, os.path.join(args.out, 'predict_misc'), MaskGenome, debug) File "/home/zctea/biosoft/funannotate/lib/library.py", line 3792, in RepeatModelMask os.rename(os.path.join(outdir, RP_folder, 'consensi.fa.classified'), library) UnboundLocalError: local variable 'RP_folder' referenced before assignment


nextgenusfs commented 6 years ago

Sounds like repeatmasker or repeatmodeler is not properly installed, check the logiles for some clues as to why.

zctea commented 6 years ago

Thanks for your quick response!

I installed RepBase before running configure. RepeatMasker (4.0.7) has generated an empty RepeatMasker.lib file. Appears the same issue: https://github.com/rmhubley/RepeatModeler/issues/10

RepeatMasker configure

Congratulations! RepeatMasker is now ready to use. The program is installed with a the following repeat libraries: Dfam database version Dfam_2.0 RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127 Further documentation on the program may be found here: /home/zctea/biosoft/RepeatMasker/repeatmasker.help

RepeatModeler configure

WARNING: /home/zctea/biosoft/RepeatMasker/Libraries/RepeatMasker.lib.[n??] doesn't exist! RepeatModeler will not run correctly without these files. Please re-run /home/zctea/biosoft/RepeatMasker/configure to create these files automatically. Then re-run this script.

I tried to use the script called buildRMLibFromEMBL.pl in the util directory of RepeatMasker to convert RepeatMaskerLib.embl to fasta fromate RepeatMasker.lib, and run makeblastdb on the RepeatMasker.lib file.

buildRMLibFromEMBL.pl RepeatMaskerLib.embl > RepeatMasker.lib makeblastdb -in RepeatMasker.lib -out RepeatMasker.lib -dbtype nucl -parse_seqids

However, it failed with parser error FASTA-Reader: Ignoring invalid residues at position(s): On line 2474091: 1-50 FASTA-Reader: Ignoring invalid residues at position(s): On line 2474092: 1-50 FASTA-Reader: Ignoring invalid residues at position(s): On line 2474093: 1-50 FASTA-Reader: Ignoring invalid residues at position(s): On line 2474094: 1-50 FASTA-Reader: Ignoring invalid residues at position(s): On line 2474095: 1-19 FASTA-Reader: Ignoring invalid residues at position(s): On line 2488736: 4 Adding sequences from FASTA; added 45447 sequences in 4.1424 seconds.

Relevant issue: https://www.biostars.org/p/63444/ https://www.biostars.org/p/202806/

nextgenusfs commented 6 years ago

Did you try to re-download the RepBase database, untar, and then move into the Libraries folder of RepeatMasker, then re-run the configure scripts from the RepeatMasker base directory, specify rmblast as default. Other than that I don't know what else it could be? What version of blast are you running, it does use makeblastdb from blast+ package.

zctea commented 6 years ago

Thank you for your kindly help!

You are right, I specify HMMER3.1 & DFAM as default during configuring the RepeatMasker, and it cause the error.

I ran makeblastdb (BLAST 2.2.31+) on RepeatMasker.lib. Although makeblastdb process output the parser error, RepeatModeler still installed fine after this.

I run the following command:

funannotate predict -i TAIR10_genome.fasta --species "Arabidopsis thaliana"  --organism other  --busco_db embryophyta --busco_seed_species arabidopsis --cpus 28  --transcript_evidence TAIR10_cdna.fasta -o predict

The following is the log and output files of funannotate predict run: Why the Evidence modeler was failed?

[05/26/18 16:44:28]: /home/zctea/biosoft/funannotate/bin/funannotate-predict.py -i TAIR10_genome.fasta --species Arabidopsis thaliana --organism other --busco_db embryophyta --busco_seed_species arabidopsis --cpus 28 --transcript_evidence TAIR10_cdna.fasta -o predict

[05/26/18 16:44:28]: OS: linux2, 32 cores, ~ 182 GB RAM. Python: 2.7.15
[05/26/18 16:44:28]: Running funannotate v1.3.3
[05/26/18 16:44:30]: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER and BUSCO
[05/26/18 16:44:43]: Masked genome: 5 scaffolds; 119,146,348 bp; 14.85% repeats masked
[05/26/18 16:44:43]: Existing transcript alignments found: predict/predict_misc/transcript_alignments.gff3
[05/26/18 16:44:58]: Existing Exonerate alignments found: predict/predict_misc/exonerate.out
[05/26/18 16:45:00]: /home/zctea/biosoft/augustus/scripts/exonerate2hints.pl --in=predict/predict_misc/exonerate.out --out=predict/predict_misc/hints.P.gff --minintronlen=10 --maxintronlen=3000
[05/26/18 16:45:20]: perl /home/zctea/biosoft/augustus/scripts/join_mult_hints.pl
[05/26/18 16:45:22]: Running GeneMark-ES on assembly
[05/26/18 16:45:22]: gmes_petap.pl --ES --max_intron 3000 --soft_mask 5000 --cores 28 --sequence /home/zctea/biowork/TAIR10/predict/predict_misc/genome.softmasked.fa
[05/26/18 17:01:14]: (None, '')
[05/26/18 17:01:14]: Converting GeneMark GTF file to GFF3
[05/26/18 17:01:19]: perl /home/zctea/biosoft/evidencemodeler/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl predict/predict_misc/genemark.gff
[05/26/18 17:01:19]: Found 30,705 gene models
[05/26/18 17:01:19]: Running BUSCO to find conserved gene models for training Augustus
[05/26/18 17:32:26]: 1,401 valid BUSCO predictions found, now formatting for EVM
[05/26/18 17:33:22]: /home/zctea/biosoft/funannotate/util/fix_busco_naming.py predict/predict_misc/busco_augustus.tmp predict/predict_misc/busco/run_arabidopsis_thaliana/full_table_arabidopsis_thaliana.tsv predict/predict_misc/busco_augustus.gff3
[05/26/18 17:33:22]: bedtools intersect -a predict/predict_misc/genemark.evm.gff3 -b predict/predict_misc/buscos.bed
[05/26/18 17:33:22]: bedtools intersect -a /home/zctea/biowork/TAIR10/predict/predict_misc/transcript_alignments.gff3 -b predict/predict_misc/buscos.bed
[05/26/18 17:33:22]: bedtools intersect -a /home/zctea/biowork/TAIR10/predict/predict_misc/protein_alignments.gff3 -b predict/predict_misc/buscos.bed
[05/26/18 17:35:56]: Evidence modeler has failed, exiting

-----------------------
 

zctea@zctea ~/b/TAIR10> funannotate predict -i TAIR10_genome.fasta --species "Arabidopsis thaliana"  --organism other  --busco_db embryophyta --busco_seed_species arabidopsis --cpus 28  --transcript_evidence TAIR10_cdna.fasta -o predict 
-------------------------------------------------------
[08:31 PM]: OS: linux2, 32 cores, ~ 182 GB RAM. Python: 2.7.15
[08:31 PM]: Running funannotate v1.3.3
[08:31 PM]: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER and BUSCO
[08:32 PM]: Masked genome: 5 scaffolds; 119,146,348 bp; 14.85% repeats masked
[08:32 PM]: Existing transcript alignments found: predict/predict_misc/transcript_alignments.gff3
[08:32 PM]: Existing Exonerate alignments found: predict/predict_misc/exonerate.out
[08:32 PM]: Existing GeneMark annotation found: predict/predict_misc/genemark.gff
[08:32 PM]: Found 30,705 gene models
[08:32 PM]: Running BUSCO to find conserved gene models for training Augustus
[09:03 PM]: 1,401 valid BUSCO predictions found, now formatting for EVM
[09:04 PM]: Setting up EVM partitions
[09:06 PM]: Generating EVM command list
[09:06 PM]: Running EVM commands with 27 CPUs
[09:07 PM]: Combining partitioned EVM outputs
[09:07 PM]: Converting EVM output to GFF3
[09:07 PM]: Collecting all EVM results
[09:07 PM]: Evidence modeler has failed, exiting
-----------------
 

drwxr-xr-x 4 zctea zctea 4.0K 5月  26 17:01 busco/
-rw-r--r-- 1 zctea zctea    0 5月  26 17:33 busco_augustus.gff3
-rw-r--r-- 1 zctea zctea    0 5月  26 17:32 busco_augustus.tmp
-rw-r--r-- 1 zctea zctea    0 5月  26 17:33 busco_genemark.gff3
-rw-r--r-- 1 zctea zctea    0 5月  26 17:33 busco_predictions.gff3
-rw-r--r-- 1 zctea zctea    0 5月  26 17:33 busco_proteins.gff3
-rw-r--r-- 1 zctea zctea  48K 5月  26 17:32 buscos.bed
-rw-r--r-- 1 zctea zctea    0 5月  26 17:33 busco_transcripts.gff3
-rw-r--r-- 1 zctea zctea   99 5月  26 17:33 busco_weights.txt
-rw-r--r-- 1 zctea zctea  30M 5月  26 00:22 exonerate.out
drwxr-xr-x 6 zctea zctea 4.0K 5月  26 17:01 genemark/
-rw-r--r-- 1 zctea zctea    0 5月  26 17:01 genemark.evm.gff3
-rw-r--r-- 1 zctea zctea    0 5月  26 17:01 genemark.evm.gff3.bak
-rw-r--r-- 1 zctea zctea  36M 5月  26 17:01 genemark.gff
-rw-r--r-- 1 zctea zctea    0 5月  26 17:01 genemark.temp.gff
-rw-r--r-- 1 zctea zctea 116M 5月  26 16:44 genome.fasta
-rw-r--r-- 1 zctea zctea 116M 5月  25 22:05 genome.softmasked.fa
-rw-r--r-- 1 zctea zctea 898K 5月  26 16:59 gmhmm.mod
-rw-r--r-- 1 zctea zctea  14M 5月  26 16:45 hints.ALL.gff
-rw-r--r-- 1 zctea zctea  24M 5月  26 16:45 hints.all.sort.tmp
-rw-r--r-- 1 zctea zctea  24M 5月  26 16:45 hints.all.tmp
-rw-r--r-- 1 zctea zctea  17M 5月  25 22:05 hints.M.gff
-rw-r--r-- 1 zctea zctea 6.8M 5月  26 16:45 hints.P.gff
-rw-r--r-- 1 zctea zctea  62M 5月  26 00:22 p2g.diamond.out
drwxrwxr-x 5 zctea zctea 4.0K 5月  26 15:39 predict/
-rw-r--r-- 1 zctea zctea 6.3M 5月  26 16:45 protein_alignments.gff3
-rw-r--r-- 1 zctea zctea 6.3M 5月  26 16:44 protein_alignments.gff3.old
-rw-r--r-- 1 zctea zctea 198M 5月  26 16:44 proteins.combined.fa
-rw-r--r-- 1 zctea zctea 198M 5月  26 16:44 proteins.combined.fa.old
drwxr-xr-x 2 zctea zctea 4.0K 5月  25 22:04 RepeatMasker/
-rw-r--r-- 1 zctea zctea    0 5月  25 22:05 repeatmasker.gff3
drwxr-xr-x 3 zctea zctea  36K 5月  25 21:55 RepeatModeler/
-rw-r--r-- 1 zctea zctea 553K 5月  25 21:55 repeatmodeler.lib.fa
-rw-r--r-- 1 zctea zctea   29 5月  26 16:44 scaffold.sort.order.txt
-rw-r--r-- 1 zctea zctea   40 5月  26 16:44 scaffold.sort.rename.txt
-rw-r--r-- 1 zctea zctea  12M 5月  25 22:05 transcript_alignments.gff3
-rw-r--r-- 1 zctea zctea  12M 5月  26 16:44 transcript_alignments.gff3.old
-rw-r--r-- 1 zctea zctea  12M 5月  25 22:05 transcript_minimap2.gff3
-rw-r--r-- 1 zctea zctea  89M 5月  25 22:05 transcripts.combined.fa
-rw-r--r-- 1 zctea zctea  24M 5月  25 22:05 transcripts.minimap2.bam

I am a beginner of bioinformatics. I would appreciate it if you could give me some guidance on how to adjust the options to fit the pipeline for non-model plant genome annotation.

Taking Camellia sinensis for example:

funannotate predict \
-i genome.fasta \
-o predict \
--species "Camellia sinensis"  \
--organism other  \
--busco_db embryophyta \
--busco_seed_species ?*** (non-model plant) \
--augustus_species ?*** (non-model plant) \
--optimize_augustus ?*** \
--transcript_evidence trinity.fasta \
--rna_bam alignments.bam \
--protein_evidence uniprot.fa \
--cpus 28  
nextgenusfs commented 6 years ago

Did you check the EVM log file? Perhaps a missing perl dependency?? Generally your command looks good, a few suggestions:

You can check which species are available (pre-trained) using the funannotate species command.

$ funannotate species
--------------------------
AUGUSTUS species options:
--------------------------
Conidiobolus_coronatus                      cryptococcus                                pfalciparum                                 
E_coli_K12                                  cryptococcus_neoformans_gattii              phanerochaete_chrysosporium                 
Xipophorus_maculatus                        cryptococcus_neoformans_neoformans_B        pichia_stipitis                             
adorsata                                    cryptococcus_neoformans_neoformans_JEC21    pneumocystis                                
aedes                                       culex                                       rhizopus_oryzae                             
amphimedon                                  debaryomyces_hansenii                       rhodnius                                    
ancylostoma_ceylanicum                      elegans                                     rice                                        
anidulans                                   elephant_shark                              rubeus_macgubis                             
arabidopsis                                 encephalitozoon_cuniculi_GB                 s_aureus                                    
aspergillus_fumigatus                       eremothecium_gossypii                       s_pneumoniae                                
aspergillus_nidulans                        fly                                         saccharomyces                               
aspergillus_oryzae                          fusarium                                    saccharomyces_cerevisiae_S288C              
aspergillus_terreus                         fusarium_graminearum                        saccharomyces_cerevisiae_rm11-1a_1          
b_pseudomallei                              galdieria                                   schistosoma                                 
bombus_impatiens1                           generic                                     schistosoma2                                
bombus_terrestris2                          heliconius_melpomene1                       schizosaccharomyces_pombe                   
botrytis_cinerea                            histoplasma                                 seahare                                     
brugia                                      histoplasma_capsulatum                      sulfolobus_solfataricus                     
c_elegans_trsk                              honeybee1                                   template_prokaryotic                        
cacao                                       human                                       tetrahymena                                 
caenorhabditis                              kluyveromyces_lactis                        thermoanaerobacter_tengcongensis            
camponotus_floridanus                       laccaria_bicolor                            tomato                                      
candida_albicans                            lamprey                                     toxoplasma                                  
candida_guilliermondii                      leishmania_tarentolae                       tribolium2012                               
candida_tropicalis                          lodderomyces_elongisporus                   trichinella                                 
chaetomium_globosum                         magnaporthe_grisea                          ustilago                                    
chicken                                     maize                                       ustilago_maydis                             
chlamy2011                                  maize5                                      verticillium_albo_atrum1                    
chlamydomonas                               nasonia                                     verticillium_longisporum1                   
chlorella                                   neurospora                                  wheat                                       
coccidioides_immitis                        neurospora_crassa                           yarrowia_lipolytica                         
coprinus                                    parasteatoda                                zebrafish                                   
coprinus_cinereus                           pchrysosporium                                                                          
coyote_tobacco                              pea_aphid          

So you can use --busco_seed_species arabidopsis or perhaps a more closely related plant genome (if one exists). Leave the --augustus_species option off of the command -- this will enforce funannotate to automatically train for your genome using the --busco_seed_species parameters to run BUSCO and then use those gene models to train Augustus. You may also want to adjust the --min_intronlen parameter to something higher (default is for fungi).

If you have RNA-seq, you can also utilize the funannotate train command which will run genome guided Trinity and PASA for you.