nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
301 stars 82 forks source link

other gff file could not be passed to EVM #782

Closed xizhesun closed 1 year ago

xizhesun commented 1 year ago

Are you using the latest release? funannotate v1.8.13

Describe the bug I have set "-other_gff compreh_init_build.gff3:10" It was recognised by funannotate, was set source to other_pred1. But other_pred1 was not showed in the gene models which passed to EVM.

What command did you issue? funannotate predict --name FOL007 -i Fol007.final.repeatmasker.fasta -o Fol007_anotation_Trinity-GG_compreh --pasa_gff mydb.sqlite.assemblies.fasta.transdecoder.genome.gff3 --transcript_evidence Trinity-GG.fasta --rna_bam fo.sort.merged.bam --cpus 128 -s "fusarium oxysporum" --strain Fol007 --other_gff compreh_init_build.gff3:10 --busco_db sordariomycetes

Logfiles Please provide relavent log files of the error.

[Sep 13 02:31 PM]: OS: Ubuntu 22.04, 384 cores, ~ 2113 GB RAM. Python: 3.8.13 [Sep 13 02:31 PM]: Running funannotate v1.8.13 [Sep 13 02:31 PM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus pasa
codingquarry rna-bam
genemark selftraining
glimmerhmm pasa
snap pasa
[Sep 13 02:31 PM]: Parsing GFF pass-through: compreh_init_build.gff3 --> setting source to other_pred1 [Sep 13 02:54 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [Sep 13 02:54 PM]: Genome loaded: 15 scaffolds; 56,041,038 bp; 0.68% repeats masked [Sep 13 02:54 PM]: Aligning transcript evidence to genome with minimap2 [Sep 13 02:54 PM]: Found 44,806 alignments, wrote GFF3 and Augustus hints to file [Sep 13 02:54 PM]: Extracting hints from RNA-seq BAM file using bam2hints [Sep 13 08:44 PM]: Mapping 555,213 proteins to genome using diamond and exonerate [Sep 13 08:46 PM]: Found 298,521 preliminary alignments with diamond in 0:01:16 --> generated FASTA files for exonerate in 0:00:23 [Sep 13 08:51 PM]: Exonerate finished in 0:05:14: found 2,100 alignments [Sep 13 08:51 PM]: Running GeneMark-ES on assembly [Sep 13 09:08 PM]: 19,044 predictions from GeneMark [Sep 13 09:08 PM]: Filtering PASA data for suitable training set [Sep 13 09:12 PM]: 4,019 of 57,756 models pass training parameters [Sep 13 09:12 PM]: Training Augustus using PASA gene models [Sep 13 09:13 PM]: Augustus initial training results: Feature Specificity Sensitivity nucleotides 95.0% 92.5%
exons 79.0% 76.5%
genes 55.2% 53.2%
[Sep 13 09:13 PM]: Running Augustus gene prediction using fusarium_oxysporum_fol007 parameters [Sep 13 09:13 PM]: 15,302 predictions from Augustus [Sep 13 09:13 PM]: Pulling out high quality Augustus predictions [Sep 13 09:13 PM]: Found 7,688 high quality predictions from Augustus (>90% exon evidence) [Sep 13 09:13 PM]: Running stringie on RNA-seq alignments [Sep 13 10:16 PM]: Running CodingQuarry prediction using stringtie alignments [Sep 13 10:38 PM]: 21,427 predictions from CodingQuarry [Sep 13 10:38 PM]: Running SNAP gene prediction, using training data: Fol007_anotation_Trinity-GG_compreh/predict_misc/final_training_models.gff3 [Sep 13 10:42 PM]: 18,957 predictions from SNAP [Sep 13 10:42 PM]: Running GlimmerHMM gene prediction, using training data: Fol007_anotation_Trinity-GG_compreh/predict_misc/final_training_models.gff3 [Sep 13 10:53 PM]: 18,115 predictions from GlimmerHMM [Sep 13 10:53 PM]: Summary of gene models passed to EVM (weights): Source Weight Count Augustus 1 7614
Augustus HiQ 2 7688
CodingQuarry 2 21427 GeneMark 1 19044 GlimmerHMM 1 18115 pasa 6 57756 snap 1 18957 Total - 150601 [Sep 13 10:53 PM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval [Sep 13 11:37 PM]: Converting to GFF3 and collecting all EVM results [Sep 13 11:37 PM]: 22,839 total gene models from EVM [Sep 13 11:37 PM]: Generating protein fasta files from 22,839 EVM models [Sep 13 11:37 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [Sep 13 11:37 PM]: Found 1,360 gene models to remove: 7 too short; 0 span gaps; 1,353 transposable elements [Sep 13 11:37 PM]: 21,479 gene models remaining [Sep 13 11:37 PM]: Predicting tRNAs [Sep 13 11:39 PM]: 296 tRNAscan models are valid (non-overlapping) [Sep 13 11:39 PM]: Generating GenBank tbl annotation file [Sep 13 11:40 PM]: Collecting final annotation files for 21,775 total gene models [Sep 13 11:40 PM]: Converting to final Genbank format [Sep 13 11:41 PM]: Funannotate predict is finished, output files are in the Fol007_anotation_Trinity-GG_compreh/predict_results folder [Sep 13 11:41 PM]: Your next step to capture UTRs and update annotation using PASA:

OS/Install Information Ubuntu 22.04 LTS "funannotate test -t busco --cpus 4" was success

You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.38 Clone: 0.42 DBD::SQLite: 1.70 DBI: 1.643 DB_File: 1.855 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.52 Hash::Merge: 0.302 JSON: 4.09 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.12 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 ERROR: DBD::mysql not installed, install with cpanm DBD::mysql

Checking Environmental Variables... $FUNANNOTATE_DB=/home/data2/mals/funannotate $PASAHOME=/home/data2/mals/anaconda3/envs/funannotate/opt/pasa-2.5.2 $TRINITY_HOME=/home/data2/mals/anaconda3/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/home/data2/mals/anaconda3/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/home/data2/mals/anaconda3/envs/funannotate/config/ $GENEMARK_PATH=/home/data2/mals/tools/GeneMark-ET/gmes_linux_64 All 6 environmental variables are set

Checking external dependencies... PASA: 2.5.2 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.4.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.30.0 blat: BLAT v35 diamond: 2.0.15 emapper.py: 2.1.9 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: no way to determine glimmerhmm: 3.0.4 gmap: 2021-08-25 gmes_petap.pl: 4.69_lic hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 11.0.13 kallisto: 0.46.1 mafft: v7.505 (2022/Apr/10) makeblastdb: makeblastdb 2.2.31+ minimap2: 2.24-r1122 pigz: pigz 2.6 proteinortho: 6.1.0 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.15.1 signalp: 5.0b snap: 2006-07-28 stringtie: 2.1.7 tRNAscan-SE: 2.0.9 (July 2021) tantan: tantan 39 tbl2asn: no way to determine, likely 25.X tblastn: tblastn 2.2.31+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 All 37 external dependencies are installed

nextgenusfs commented 1 year ago

Is this a trinity comprehensive GFF3 file? Are you sure the contig/reference GFF3 coordinates are in relation to your actual genome? If the contigs in GFF3 are not found in your genome, those will be ignored.

xizhesun commented 1 year ago

yes, it's a trinity comprehensive GFF3 file. I have checked it, it's the same genome. In my funannotate result by the default setting, some important gene models were lost. So I am trying to use "other_gff" to fix it.

Thanks for the reply. I wil try mydb.sqlite.pasa_assemblies.gff3 pass to the "other_gff" and try again.

nextgenusfs commented 1 year ago

The PASA assemblies GFF3 file also does not map to the genome reference, you need to use the one that has .genome. in the name, you can just look at first few entries and see if column 1 of the file are contig names from your genome assembly.

xizhesun commented 1 year ago

yes, the column 1 is consist with the chromosome ID.

image image
nextgenusfs commented 1 year ago

Okay. Well it fails because those aren't gene models. They are just alignments.

xizhesun commented 1 year ago

Thanks for explanation in detail. So, I need to use the gff file from the "pasa_asmbls_to_training_set.dbi". But after the "training_set" step, the "mydb.sqlite.assemblies.fasta.transdecoder.cds" has lost some import genes. But these genes are totally in the Trinity genome guided assembly. Is there a method to convert the Trinity genome-guided assembly to gene models without training? because I'm sure these important genes are all in the Trinity assembly. But no initio tools could annotate all these important genes. These important genes are pathogenic related genes, maybe some of thm are horizontal transfer gene.

xizhesun commented 1 year ago

I checked it again. The "Trinity-GG.fasta" contains all the important genes. After PASA step, the "mydb.sqlite.assemblies.fasta" also contains all the important genes. But some of the important genes of PASA gene models will be filted out by transdecoder! I really don't know how to deal with it. Is there any other recommended method which is more based on the original transcript evidence without filter?

xizhesun commented 1 year ago

FInally, I solved the problem by using Transdecoder without the filter. thank you! I will close this issue.