nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 84 forks source link

How is evidence used exactly? #283

Closed photocyte closed 4 years ago

photocyte commented 5 years ago

Hi there,

funannotate seems to use a lot of evidence, but I wasn't able to find specifics on how the evidence is used to tune ab-initio prediction exactly in the docs. For example, what evidence is used for direct augustus hints, and what evidence is provided to evidence modeler? Are certain types of evidence used for both, or for only one?

All the best, -Tim

nextgenusfs commented 5 years ago

The short answer is that it uses evidence whenever possible. Transcript alignments via minimap2/blat/gmap and protein alignments via diamond/exonerate are used to train Augustus (via selecting gene models to use for training), during Augustus prediction (fed via --hintsfile), and fed to evidence modeler. The evidence is also used to selectively pull out high quality Augustus models (those where exons >90% of exons are supported by evidence). The aligned evidence is not used to predict gene models directly.

photocyte commented 5 years ago

Thanks for the feedback! Gave me the confidence to dive into the code.

What do you mean that the aligned evidence is not used to predict gene models directly? Seems like it is from the code?

Seems like "funannotate predict" also uses hints/extrinsic data: https://github.com/nextgenusfs/funannotate/blob/30c116643a31137f1cbfc747b02bf6bd0a1c158a/bin/funannotate-predict.py#L426-L431

I wasn't able to find a exact place where "funnannotate train" calls augustus with hints, but seems like augustus is typically called with a hints file: https://github.com/nextgenusfs/funannotate/blob/30c116643a31137f1cbfc747b02bf6bd0a1c158a/bin/augustus_parallel.py#L44-L45

Also, is there a way to feed hints for externally aligned transcripts / proteins evidence into funannotate predict from a GFF3 file, rather than having funannotate align it internally? I'd like to use funannotate predict for the hints incorporation w/ augustus, but can parallelize the transcript / protein alignments on our computing cluster independently so things complete faster.

I note there is a --maker_gff option, but there is a bit of an ominous note in the docs:

Note this option is provided out of convenience for the user, however, it won’t provide the best results.
nextgenusfs commented 5 years ago

What do you mean that the aligned evidence is not used to predict gene models directly? Seems like it is from the code?

EvidenceModeler does not use protein/transcript alignments to create gene models de novo -- the gene models must be present in the gene predictions GFF3 file. EVM scores the gene predictions (gene models) based on the alignment evidence. Funannotate does not do any extra routines/methods to predict gene models directly from the transcript/protein alignments -- it uses the data to provide hints to Augustus -- and then passes those alignments to EVM to use to score/weight the consensus predictions.

Also, is there a way to feed hints for externally aligned transcripts / proteins evidence into funannotate predict from a GFF3 file, rather than having funannotate align it internally?

Currently there is not a way to do this. I can't imagine a scenario where minimap2 transcript alignments would be slowing you down -- it is incredibly fast -- note minimap2 is the default. Protein alignments on the other hand are slow -- because we are trying to get intron/exon boundaries we use exonerate which is not fast. I've tried to speed it up with multiprocessing and slitting the calls to exonerate, but it is not fast. I would not get carried away with protein models to use as evidence -- the default is to use UniProt/SwissProt which in most cases is sufficient. If you have highly curated/validated protein models from a closely related species you can use those as well. But in general the transcript alignments are more robust in terms of generating hints.

I wasn't able to find a exact place where "funnannotate train" calls augustus with hints, but seems like augustus is typically called with a hints file:

funannotate train is a wrapper for running genome guided TRINITY followed by PASA from RNA-seq data (either Illumina and/or pacbio/nanopore). These data are then used seamlessly in funannotate predict if they exist. The training of Augustus is done in the funannotate predict workflow because it also supports BUSCO mediated training (i.e you don't have any RNA-seq evidence). Augustus is incredibly sensitive to gene models used for training, so it uses some scripts to filter gene models based on the hintsfile.

The maker option is just to convert maker GFF3 to funannotate format -- it will not rerun any predictions but will filtering the gene models through EVM and subsequently make the consensus gene models pass NCBI submission requirements.

photocyte commented 5 years ago

Thanks! Very useful information. Yes, you have some notes in the docs about being careful about the protein evidence. My use case is calling transdecoder on the Trinity.fasta file (de novo transcriptome from the same species as the genome), and then passing those resulting proteins as the evidence to funannotate. As for when alignments might be slowing things down, I'm planning to use funannotate for a very large (human sized) insect genome, so the alignment parallelization might be pretty important.

nextgenusfs commented 5 years ago

There would be no real benefit gained from taking transdecoder protein files and aligning to the genome if you have transcripts -- pass the transcripts as they will map more precisely and faster. The transcript mapping can handle down to 80% identity, i.e. so EST from closely related species could work as well.

If you have already run trinity you can pass that result to --trinity option in funannotate train. This will then run PASA/transdecoder on those transcripts and simultaneously generate all of the files you need for funannotate predict.

photocyte commented 5 years ago

I don't have transcripts / peptides for every gene though. I'm hoping that the peptide alignments will work over a more distant evolutionary distance (e.g., a 60% identity peptide can be picked up by the protein alignment, I'd expect, but the transcript between the same two genes may not align over that distance). Currently planning swiss-prot + the transdecoder peptides.

P.S. I see there is a --protein_alignments and --transcript_alignments option for funannotate predict (https://funannotate.readthedocs.io/en/latest/commands.html#funannotate-predict) . So it is possible to align outside of funannotate predict, provided I give it in the right format?

nextgenusfs commented 5 years ago

Yes -- the format for --protein_alignments is adhering to EVM and is non-standard GFF3: https://evidencemodeler.github.io/#Preparing_inputs

chr1    exonerate       nucleotide_to_protein_match     1725048 1725326 92.47   -       .       ID=match.429643.2;Target=RL44B_YEAST 2 95
chr6    exonerate       nucleotide_to_protein_match     1564058 1564202 93.00   +       .       ID=match.429643.3;Target=H4_USTMA 2 50
chr6    exonerate       nucleotide_to_protein_match     1564270 1564427 93.00   +       .       ID=match.429643.3;Target=H4_USTMA 51 103
chr4    exonerate       nucleotide_to_protein_match     1008152 1008287 91.84   -       .       ID=match.429643.4;Target=H4_TRICD 5 50
chr4    exonerate       nucleotide_to_protein_match     1007921 1008081 91.84   -       .       ID=match.429643.4;Target=H4_TRICD 51 104

And per EVM -- only alignments >80% are retained for protein evidence.

Transcript GFF3 format is also non-standard per EVM format:

##gff-version 3
chr2    genome  cDNA_match      1       179     100.00  +       .       ID=minimap2_1;Target=jgi|Trire2|69928|e_gw1.31.6.1 1 179 +
chr2    genome  cDNA_match      258     1447    100.00  +       .       ID=minimap2_1;Target=jgi|Trire2|69928|e_gw1.31.6.1 180 1369 +
chr2    genome  cDNA_match      1536    2789    100.00  +       .       ID=minimap2_1;Target=jgi|Trire2|69928|e_gw1.31.6.1 1370 2829 +
chr2    genome  cDNA_match      2183    2588    100.00  +       .       ID=minimap2_2;Target=Trinity_GG_15869_c0_g1_i1 1 406 +
chr2    genome  cDNA_match      2183    2588    100.00  -       .       ID=minimap2_3;Target=Trinity_GG_6142_c1_g1_i1 1 406 -

These are all the reasons you probably should just try funannotate first -- if it is too slow or doesn't work then think about modifying....

The goal here is NOT to map all proteins that can be mapped -- it is to map protein/transcripts with high accuracy so that intron/exon boundaries can be properly identified. This is what trains the HMM models in Augustus. You do not need to find them all -- you need to have enough evidence to have Augustus generate reasonable HMM model for your species. This is somewhat arbitrary and empirical....

photocyte commented 5 years ago

Still waiting for RepeatModeler to finish, but will definitely try vanilla funannotate first :)

I am trying to avoid the internal training however, as I already have Augustus parameters that are being trained by BUSCO (taking a very long time - a few days into the process), plus I don't quite understand how or when funanotate is training the various ab-initio predictors that are used, given that funannotate train and funannotate predict both seem to train or not train based on certain parameters. I don't know if I have good reason to believe that rerunning the training within funannotate with the RNA-Seq data I have will improve the training from BUSCO alone.

I consider the transdecoder derived gene models (transcripts and peptides lifted over from transcript coordinates to genome coordinates using the transdecoder cdna_alignment_orf_to_genome_orf.pl script) to be a sort of ground truth for gene models.

So, I'm more interested in the hints being provided to Augustus to improve the ab-initio predictions. IMO the more closely related proteins (such as those from the same species), will be more informative to the previously-trained Augustus. So not sure if I follow the rationale to not map all proteins. The below is my understanding of how to do that:

funannotate predict -i ${softmasked_genome} -s ${genus_species} --transcript_evidence ${denovo_transcripts} \
    -o output --protein_evidence ${denovo_transcript_peptides} ${swissprot} \
    --name PNM \
    --augustus_species "XXXX" \ ##This makes it so it doesn't train Augustus internally. Set once completed
    --cpus ${task.cpus} \
    --organism other \
    --max_intronlen 200000 \
    --min_protlen 20 \
    --keep_no_stops \
    --ploidy 2 \

Let me know if I'm on the wrong track here. Definitely there are a lot of empirical choices for genome annotation, and oftentimes the rationale isn't very clear. Thanks for taking the time to discuss, and for writing funannotate!

nextgenusfs commented 5 years ago

The nomenclature isn't perhaps the most clear -- as Augustus is really the only thing that needs to be trained (GeneMark runs in self training mode) -- but Augustus is also the most difficult to train and where you will see the largest impact on predictions. funannotate train is somewhat of a misnomer in that it doesn't train Augustus directly, but rather generates all the precursor files needed to do RNA-seq mediated training when you run funannotate predict. One of the benefits of funannoate, is that it tries really hard to only pass the best gene models for Augustus training -- this is probably the most important step of the process and the step that distinguishes it from other pipelines -- and will have the largest impact on your predictions. There are a bunch of rules nested in this step, i.e. single exon gene models (even if valid and matched to evidence) are not used for training as it heavily biases the training step, no two gene models can be within 80% identical to each other at protein level, and all models should be supported by evidence if it exists. So decision tree for funannotate predict to run Augustus is: 1) if species exists via --augustus_species it will use existing parameters, 2a) if species doesn't exist it will first look for PASA predictions (based on evidence and validated via transdecoder), if PASA does not exist it will run BUSCO2, 2b) it will filter gene models from PASA or BUSCO that fit parameters above, 2c) run the Augustus training scripts.

My point about the mapping of protein evidence is that EVM will only use gene models derived from Augustus, GeneMark, PASA, or if you passed something to --other_gff. It uses a scoring system partially based on the evidences (transcripts/proteins), but if will not de novo predict gene models. So the real challenge is getting the ab initio predictors to properly define gene models - EVM simply chooses the best one from the pool at each locus.

RepeatModeler/Masker is incredibly slow, especially for large genomes.

Your command looks like a good place to start.