nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

Lower than expected BUSCO score for funannotate predict-derived gene models compared to genome and transcriptome BUSCO scores #928

Open kushalsuryamohan opened 1 year ago

kushalsuryamohan commented 1 year ago

Hello, I am trying to annotate a reptilian genome that has a fairly good scaffolded assembly (N50 = ~215MB; 1,966 scaffolds). Genome busco completeness is ~93%

I have RNA-seq data from a pooled set of tissues. Trinity-derived de novo transcriptome models have a slightly lower BUSCO score of ~61% (this suggests a lack of diversity in the libraries and/or poor RNA-seq data quality). Despite this, I assumed that a combination of Augustus + Pasa could provide a good annotated gene set. However, the final set of gene predictions have a ~30% BUSCO completeness score.

I'm a little confused as to how this might be happening and would appreciate some guidance on improving this.

Here are the step-by-step commands I used to get to this point.

#Training with RNA-seq data as evidence
funannotate train -i genome_cleaned_renamed.fasta -o funannotate_train -l LIB4027606_3026444_S1_R1_001.fastq.gz LIB4027607_3026445_S2_R1_001.fastq.gz -r LIB4027606_3026444_S1_R2_001.fastq.gz LIB4027607_3026445_S2_R2_001.fastq.gz --cpus 16 --max_intronlen 20000 --no-progress --s "snake"

#Predict
funannotate predict -i genome_cleaned_renamed.fasta \
            -o funannotate_train --s "snake" --no-progress --optimize_augustus --repeats2evm --busco_db tetrapoda --cpus 20 --max_intronlen 20000 --organism other

Here is the log of training:

[Jun 17 10:18 AM]: OS: CentOS Linux 7, 104 cores, ~ 791 GB RAM. Python: 3.8.15 [Jun 17 10:18 AM]: Running 1.8.15 [Jun 17 10:18 AM]: Multiple inputs for --left and --right detected, concatenating PE reads [Jun 17 10:19 AM]: Adapter and Quality trimming PE reads with Trimmomatic [Jun 17 10:26 AM]: Running read normalization with Trinity [Jun 17 11:47 AM]: Building Hisat2 genome index [Jun 17 11:59 AM]: Aligning reads to genome using Hisat2 [Jun 17 12:02 PM]: Running genome-guided Trinity, logfile: funannotate_train/training/Trinity-gg.log [Jun 17 12:02 PM]: Clustering of reads from BAM and preparing assembly commands [Jun 17 01:36 PM]: Assembling 51,010 Trinity clusters using 15 CPUs [Jun 17 03:27 PM]: 85,468 transcripts derived from Trinity [Jun 17 03:27 PM]: Running StringTie on Hisat2 coordsorted BAM [Jun 17 03:28 PM]: Removing poly-A sequences from trinity transcripts using seqclean [Jun 17 03:30 PM]: Converting transcript alignments to GFF3 format [Jun 17 03:31 PM]: Converting Trinity transcript alignments to GFF3 format [Jun 17 03:31 PM]: Running PASA alignment step using 85,435 transcripts [Jun 18 05:06 AM]: PASA assigned 84,212 transcripts to 66,640 loci (genes) [Jun 18 05:06 AM]: Getting PASA models for training with TransDecoder [Jun 18 05:22 AM]: PASA finished. PASAweb accessible via: localhost:port/cgi-bin/index.cgi?db=/home/funannotate_train/training/pasa/snake_pasa [Jun 18 05:22 AM]: Using Kallisto TPM data to determine which PASA gene models to select at each locus [Jun 18 05:22 AM]: Building Kallisto index [Jun 18 05:28 AM]: Mapping reads using pseudoalignment in Kallisto [Jun 18 05:32 AM]: Parsing expression value results. Keeping best transcript at each locus. [Jun 18 05:44 AM]: Wrote 21,812 PASA gene models [Jun 18 05:44 AM]: PASA database name: Bothrops_asper [Jun 18 05:44 AM]: Trinity/PASA has completed, you are now ready to run funanotate predict, for example:

funannotate predict -i genome_cleaned_renamed.fasta \ -o funannotate_train -s "snake" --cpus 16

Here's the BUSCO log for the trinity models:

Results: C:60.8%[S:38.9%,D:21.9%],F:9.9%,M:29.3%

And for PASA models:

C:58.1%[S:36.8%,D:21.3%],F:28.0%,M:13.9%

Below is the log of predict:

[Jun 20 03:46 PM]: OS: CentOS Linux 7, 104 cores, ~ 791 GB RAM. Python: 3.8.15 [Jun 20 03:46 PM]: Running funannotate v1.8.15 [Jun 20 03:46 PM]: Found training files, will re-use these files: --rna_bam funannotate_train/training/funannotate_train.coordSorted.bam --pasa_gff funannotate_train/training/funannotate_train.pasa.gff3 --stringtie funannotate_train/training/funannotate_train.stringtie.gtf --transcript_alignments funannotate_train/training/funannotate_train.transcripts.gff3 [Jun 20 03:46 PM]: Skipping CodingQuarry as --organism=other. Pass a weight larger than 0 to run CQ, ie --weights codingquarry:1 [Jun 20 03:46 PM]: Parsed training data, run ab-initio gene predictors as follows: ESC[4mProgram Training-MethodESC[0m augustus pasa
genemark selftraining
glimmerhmm pasa
snap pasa
[Jun 20 03:54 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [Jun 20 03:56 PM]: Genome loaded: 1,966 scaffolds; 1,648,048,789 bp; 50.07% repeats masked [Jun 20 03:56 PM]: Parsed 64,062 transcript alignments from: funannotate_train/training/funannotate_train.transcripts.gff3 [Jun 20 03:56 PM]: Creating transcript EVM alignments and Augustus transcripts hintsfile [Jun 20 03:56 PM]: Existing RNA-seq BAM hints found: funannotate_train/predict_misc/hints.BAM.gff [Jun 20 03:56 PM]: Existing protein alignments found: funannotate_train/predict_misc/protein_alignments.gff3 [Jun 20 03:57 PM]: Existing GeneMark annotation found: funannotate_train/predict_misc/genemark.gff [Jun 20 03:58 PM]: 105,111 predictions from GeneMark [Jun 20 03:58 PM]: Filtering PASA data for suitable training set [Jun 20 03:59 PM]: 3,158 of 21,812 models pass training parameters [Jun 20 03:59 PM]: Existing Augustus annotations found: funannotate_train/predict_misc/augustus.gff3 [Jun 20 03:59 PM]: Pulling out high quality Augustus predictions [Jun 20 03:59 PM]: Found 7,523 high quality predictions from Augustus (>90% exon evidence) [Jun 20 03:59 PM]: Existing snap predictions found funannotate_train/predict_misc/snap-predictions.gff3 [Jun 20 03:59 PM]: 197,409 predictions from SNAP [Jun 20 03:59 PM]: Running GlimmerHMM gene prediction, using training data: funannotate_train/predict_misc/final_training_models.gff3 [Jun 20 07:19 PM]: 331,206 predictions from GlimmerHMM [Jun 20 07:20 PM]: Summary of gene models passed to EVM (weights): [Jun 20 07:20 PM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval [Jun 20 10:48 PM]: Converting to GFF3 and collecting all EVM results ESC[4mSource Weight Count ESC[0m Augustus 1 19456 Augustus HiQ 2 7523
GeneMark 1 105111 GlimmerHMM 1 331206 pasa 6 21812 snap 1 197409 Total - 682517 [Jun 20 10:48 PM]: 31,673 total gene models from EVM [Jun 20 10:48 PM]: Generating protein fasta files from 31,673 EVM models [Jun 20 10:48 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [Jun 20 10:49 PM]: Found 985 gene models to remove: 129 too short; 1 span gaps; 855 transposable elements [Jun 20 10:49 PM]: 30,688 gene models remaining [Jun 20 10:49 PM]: Predicting tRNAs [Jun 21 01:57 AM]: 5,604 tRNAscan models are valid (non-overlapping) [Jun 21 01:57 AM]: Generating GenBank tbl annotation file [Jun 21 02:00 AM]: Collecting final annotation files for 36,292 total gene models [Jun 21 02:00 AM]: Converting to final Genbank format [Jun 21 02:04 AM]: Funannotate predict is finished, output files are in the funannotate_train/predict_results folder [Jun 21 02:04 AM]: Your next step to capture UTRs and update annotation using PASA:

funannotate update -i funannotate_train --cpus 16

The final genome annotation BUSCO as I mentioned is ~33%:

Results C:33.4%[S:32.2%,D:1.2%],F:23.6%,M:43.0%

Any help/insights here will be much appreciated!

hyphaltip commented 1 year ago

I've found this to happen in some large genomes I was working with too - though i cannot totally link it to specific issues - I suspect EVM is to blame if there is a lot of variable conflicts in the predictions, first thing to try is to rerun but set the weights for glimmerhmm to 0 given it is 3x the number of calls - I suspect lots of genes are getting split perhaps.... but I don't totally know for sure but it would be fast to drop it and have it re-run. --weights codingquarry:1 glimmerhmm:0

what I've done in the past is load the GFF in jbrowse2 and compare the individual gene tracks with the final EVM set and where the TEs are.

Another is to compare the BUSCO genome calls to the BUSCO protein calls and see what are examples of missing BUSCO genes in the annotation but are clearly in the BUSCO genome-based gene calls.

Another thought is to just try BRAKER3 and see how it compares in the predictions.

nextgenusfs commented 1 year ago

@kushalsuryamohan I don't know the problem/issue off the top of my head. I'm interested in understanding how/why the training is so different with the different ab initio tools (ie why is GlimmerHMM and snap so bad and GeneMark is 1/2 of what it should be?). Would you be interesting/willing to share the data privately and I can try to look? I would just need the assembly and your PASA GFF3 (TransDecoder filtered would be fine). Mostly I'm working on funannotate2 which is effectively a code clean up, reduce dependencies, and simplify as much as possible (one route is to replace EVM with a consensus gene model tool). I don't have this completely working yet, but if I need to make design changes for training it would be good to know sooner rather than later. If you would like to share data, can send to my email nextgenusfs at gmail.

kushalsuryamohan commented 1 year ago

Hi Jon @nextgenusfs, Sure I'd be happy to share this (although it is unpublished data so I'd appreciate it being kept private). How do you propose I share the data?

Thanks!

nextgenusfs commented 1 year ago

Yes, will keep private -- just want to see if I can figure out why this is happening and if we can fix. Is it possible to email me compressed version, or possibly a shared link from somewhere?

kushalsuryamohan commented 1 year ago

Will do. Can you share your email address, please?

nextgenusfs commented 1 year ago

nextgenusfs at gmail

kushalsuryamohan commented 1 year ago

Hi @hyphaltip and @nextgenusfs , out of curiosity and to dig deeper into this, I extracted transcripts from the augustus gff3 (from the predict_misc directory) and ran BUSCO. Here are the BUSCO results (metazoa):


    |Results from dataset metazoa_odb10               |
    --------------------------------------------------
    |C:98.2%[S:89.4%,D:8.8%],F:1.0%,M:0.8%,n:954      |
    |937    Complete BUSCOs (C)                       |
    |853    Complete and single-copy BUSCOs (S)       |
    |84     Complete and duplicated BUSCOs (D)        |
    |10     Fragmented BUSCOs (F)                     |
    |7      Missing BUSCOs (M)                        |
    |954    Total BUSCO groups searched               |
    --------------------------------------------------

I, like you, suspect that this has to do with EVM.

To recap, here are the weights passed to EVM:

Augustus 1 19456 Augustus HiQ 2 7523
GeneMark 1 105111 GlimmerHMM 1 331206 pasa 6 21812 snap 1 197409

Can I pass --weights codingquarry:1 glimmerhmm:0 snap:0 and increase weights to Augustus to 2 and Augustus HiQ to something higher? Any suggestions?

@nextgenusfs I am uploading the data per your request. Will send you an email once the data has been uploaded for you to take a look at as well. I appreciate the help very much.

kushalsuryamohan commented 1 year ago

Hi @nextgenusfs, I've uploaded the genome and PASA GFF3 and sent you an email with access details. Can you also see my previous post above? It might help you troubleshoot this as well. Thanks!

melop commented 1 year ago

For vertebrate genomes, I set weights for all gene predictors to 0 except for augustus. I never got the other ones to perform (including the non-free genemark) well and adding them makes the final prediction much worse. Your results confirmed my experience indeed.

melop commented 1 year ago

For vertebrate genomes, I set weights for all gene predictors to 0 except for augustus. I never got the other ones to perform (including the non-free genemark) well and adding them makes the final prediction much worse. Your results confirmed my experience indeed.