nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

0 gene models remaining (not possible?) #811

Open leegene1992 opened 1 year ago

leegene1992 commented 1 year ago

Describe the bug Dear my friend, I have a genome running funannotate predict, but it tells me 0 gene models remaining, which seems impossible, and I have set the --repeat_filter parameter to none. Could you please tell me why the gene models are all filtered out? Many thanks. The logfile is as follows.

Logfiles [Sep 28 06:07 PM]: OS: openSUSE Leap 15.3, 48 cores, ~ 528 GB RAM. Python: 3.7.10 [Sep 28 06:07 PM]: Running funannotate v1.8.8 [Sep 28 06:07 PM]: Found training files, will re-use these files: --rna_bam train/training/funannotate_train.coordSorted.bam --pasa_gff train/training/funannotate_train.pasa.gff3 --transcript_alignments train/training/funannotate_train.transcripts.gff3 [Sep 28 06:07 PM]: Skipping CodingQuarry as --organism=other. Pass a weight larger than 0 to run CQ, ie --weights codingquarry:1 [Sep 28 06:07 PM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus pasa genemark selftraining glimmerhmm pasa snap pasa [Sep 28 06:10 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [Sep 28 06:11 PM]: Genome loaded: 124 scaffolds; 859,440,381 bp; 0.01% repeats masked [Sep 28 06:11 PM]: Parsed 92,543 transcript alignments from: train/training/funannotate_train.transcripts.gff3 [Sep 28 06:11 PM]: Creating transcript EVM alignments and Augustus transcripts hintsfile [Sep 28 06:11 PM]: Extracting hints from RNA-seq BAM file using bam2hints [Sep 28 06:11 PM]: Loading protein alignments all.exonerate.gff3 [Sep 28 06:12 PM]: Running GeneMark-ES on assembly [Oct 15 11:21 PM]: 83,954 predictions from GeneMark [Oct 15 11:21 PM]: Filtering PASA data for suitable training set [Oct 15 11:22 PM]: 4,728 of 24,458 models pass training parameters [Oct 15 11:22 PM]: Training Augustus using PASA gene models [Oct 15 11:26 PM]: Augustus initial training results: Feature Specificity Sensitivity nucleotides 93.6% 76.2% exons 82.4% 78.1% genes 22.8% 18.1% [Oct 16 06:17 AM]: Augustus optimized training results: Feature Specificity Sensitivity nucleotides 95.5% 77.7% exons 85.5% 79.7% genes 31.3% 24.8% [Oct 16 06:17 AM]: Running Augustus gene prediction using jenynsia.lineata parameters [Oct 16 08:51 AM]: 47,165 predictions from Augustus [Oct 16 08:51 AM]: Pulling out high quality Augustus predictions [Oct 16 08:51 AM]: Found 15,193 high quality predictions from Augustus (>90% exon evidence) [Oct 16 08:51 AM]: Running SNAP gene prediction, using training data: train/predict_misc/final_training_models.gff3 [Oct 16 09:37 AM]: 102 predictions from SNAP [Oct 16 09:37 AM]: Running GlimmerHMM gene prediction, using training data: train/predict_misc/final_training_models.gff3 [Oct 16 10:48 AM]: 92,708 predictions from GlimmerHMM [Oct 16 10:49 AM]: Summary of gene models passed to EVM (weights): [Oct 16 10:49 AM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval [Oct 17 06:53 AM]: Converting to GFF3 and collecting all EVM results Source Weight Count Augustus 1 31972 Augustus HiQ 2 15193 GeneMark 1 83954 GlimmerHMM 1 92708 pasa 6 24458 snap 1 102 Total - 248387 [Oct 17 06:53 AM]: 35,769 total gene models from EVM [Oct 17 06:53 AM]: Generating protein fasta files from 35,769 EVM models [Oct 17 06:53 AM]: now filtering out bad gene models (< 0 aa in length, transposable elements, etc). [Oct 17 06:54 AM]: 0 gene models remaining [Oct 17 06:54 AM]: Predicting tRNAs Error: The requested file (train/predict_misc/evm.cleaned.gff3) could not be opened. Error message: (No such file or directory). Exiting!

nextgenusfs commented 1 year ago

A few things, snap seems to be failing, likely this is because the conda version is corrupt.

[Oct 16 08:51 AM]: Running SNAP gene prediction, using training data: train/predict_misc/final_training_models.gff3
[Oct 16 09:37 AM]: 102 predictions from SNAP

It seems like just failing on the filtering step, what is the behavior if you run --repeat_filter overlap?

leegene1992 commented 1 year ago

A few things, snap seems to be failing, likely this is because the conda version is corrupt.

[Oct 16 08:51 AM]: Running SNAP gene prediction, using training data: train/predict_misc/final_training_models.gff3
[Oct 16 09:37 AM]: 102 predictions from SNAP

It seems like just failing on the filtering step, what is the behavior if you run --repeat_filter overlap?

Yes, SNAP indeed failed. I have fixed it based on https://github.com/nextgenusfs/funannotate/issues/386. But I got the same error running --repeat_filter overlap. Actually I have two genomes running funannotate, one is fragmented (longest contig is 350kb), this one runs perfectly. Another one ( longest contig is 42M) shows errors above. Does the size of genome affect? I would really appreciate if the problem can be solved.

leegene1992 commented 1 year ago

I think I found where is the problem. I tested to run --repeat_filter overlap blast and it ran successfully, but some genes were filtered out. When I set --repeat_filter none, no genes will be filtered, then it causes problem. So I guess the problem here is from the counting and filtering steps. I hope it can be fixed soon, because for me, I would rather not filter any genes at the beginning, but filter them afterwards. Many thanks.