nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Box flow-chart for funannotate pipeline? #70

Closed MichaelFokinNZ closed 4 years ago

MichaelFokinNZ commented 7 years ago

Hi, do you have may be a kind of box chart for funannotate pipeline? Smth like image

MichaelFokinNZ commented 7 years ago

Is it correct overall representation of the funannotate_predict pipeline (with some extras)? Design is draft and I will also highlight somehow steps wrapped into funannotate.... May be it is better to divide Genemark and Augustus predictions...

picture1

hyphaltip commented 7 years ago

I especially like how the important steps to make the consensus gene models is missing from the JGI chart or how the different evidence is weighed...

hyphaltip commented 7 years ago

Glad you taking a stab at this.

Also your chart should probably more generic unless this is for you paper otherwise people will think you need RNA or need 3 RNAseq libraries...

Some data are optional:

Might be useful to show the default EVM weights assigned to each of the evidence being input.

I'm unclear what "Merge" means. @nextgenusfs has a transposase filtering step after the proteins are predicted to filter out Tpases - would be good to clarify parameters of that and what stage of the data that is performed on.

The assignment of function / description to gene products is also important, the swissprot desc for example and any cleanup performed on that after the fact could be documented as well.

nextgenusfs commented 7 years ago

Generally the flow-gram for funannotate predict is accurate. @hyphaltip points out a few things that are important if this were to be a "general" flow gram for funannotate.

To clarify a bit on the evidence usage: funannotate does use the mapped evidence (ESTs, transcripts, proteins) during Augustus (of course only if that evidence is passed). The more evidence you can pass, theoretically the annotation should improve. The other thing that it does is pull out high quality annotations from Augustus, which are defined as gene models where > 90% of the exons are supported by evidence. These are then passed to EVM and given 5X more weight than the the rest of the gene models. The intent here is to tilt the EVM scoring matrix in favor of these high quality predictions. If PASA/Transdecoder gene models are passed they are given an EVM weight of 10, while de novo predictions and EST/protein mapping evidence is all weighted at 1.

The transposase/repeats filtering is done in two steps: 1) is a protein Blast search against known transposases from RepBase + Transposon PSI database and 2) it flags/removes gene models that are > 90% contained inside a masked repetitive element. This is quite a bit different than what Maker does. I do this method because highly repetitive genomes are overly masked by the approach in Maker (RepeatRunner) which results in lots of likely real gene models that are not annotated. The original genome I wrote this for has ~ 40% repeats, so nearly 10% of the final gene models were not called by Maker due to the repeat masking approach.

MichaelFokinNZ commented 7 years ago

Yes, i've added some extras/optional steps, just because i am describing how i am using funannotate (a bit selfish:). It is for predict stage only, since annotate is more custom and straightforward.

With HQ Augustus models - are they tested against transcripts/proteins or also pasa? So, there are two overlapping outputs of Augustus - normal EVM_1 models and HQ EVM_5 models? Right? And this EVM_5 weight is not reflected in the EVM_weights.txt file in predict_misc folder (not exact filename, i am currently not at computer)? Does it work in the same way if pasa output provided?

With repeat-masking: does it mean that soft? masked sequences are also annotated through all pipeline and gene models models inside them are removed only at the latest (what?) stage?

And thank you much for explaining the differences with Maker i used before.

nextgenusfs commented 7 years ago

The HiQ models are only used if you provide evidence, i.e. weights with default settings should look something like this:

ABINITIO_PREDICTION Augustus    1
ABINITIO_PREDICTION GeneMark    1
OTHER_PREDICTION    HiQ 5
PROTEIN exonerate   1

If PASA/transecoder is passed, then weights would look like:

ABINITIO_PREDICTION Augustus    1
ABINITIO_PREDICTION GeneMark    1
OTHER_PREDICTION    HiQ 5
PROTEIN exonerate   1
OTHER_PREDICTION    transdecoder    10

The in silico predictors deal with the soft-masking genomes internally, but yes they are not hard-masked so gene models can be predicted in those regions. After EVM runs, the models are then filtered for repeats (as above). Then the models are run through tbl2asn, some errors are automatically parsed and a few more common mistakes are removed. Then final GFF/GBK is created. The filtering is challenging as I do my best to automatically parse the NCBI error report, but NCBI will change this often -> for example the tRNAs > 150bp didn't used to be an error, etc.

MichaelFokinNZ commented 7 years ago

Thank you, so now you can just copy-paste posts from this thread to wiki :) The updated flow-chart (not the final design) picture1

nextgenusfs commented 7 years ago

Yes I know the wiki isn't great.... need to find some time to write this whole thing up as well as update wiki/manual. I feel like I'm getting closer to a v1.0 release. A few more bugs worked out should be nearly there.

MichaelFokinNZ commented 7 years ago

And plans to include (if not yet there) RNAmmer like software?

MichaelFokinNZ commented 7 years ago

Oh, BUSCO proteins missed in my chart, are they used in funannotate_predict other than for Augustus training in the absence of RNAseq?

MichaelFokinNZ commented 7 years ago

And one more question - are there transposomes (and other repetitive elements) preserved in the final annotation? In general I would like to have them there, not for NCBI, but for my own use...

nextgenusfs commented 7 years ago

RNAmmer - most rRNA/rDNA scaffolds should be removed from draft genome assemblies as the order/copy number/structure/etc is nearly impossible to assemble correctly with short reads. Perhaps long read technology will change this, but you likely should be dropping these contigs/scaffolds from the assembly prior to annotating. And RNAmmer is linux only....

BUSCO2 is only used to train Augustus in funannotate predict.

If you are interested in repeats/transposable elements I would use an additional package to do a more complete search for repeats than just RepeatModeler/RepeatMasker. @hyphaltip may have some suggestions for finding/annotating repetitive elements. But typically they are left out of annotated NCBI genomes.

hyphaltip commented 7 years ago

REPET is more comprehensive repeat analysis I would use if you are truly interested in characterizing the repeat diversity in the fungi. Otherwise RepeatModeler + RepeatMasker and then the post-prediction cleanup seems good enough tradeoff of speed and sensitivity to remove things which are contaminating typical studies of protein coding gene content.

If you do a comparative analysis / ortholog clustering of genomes annotated by different sources you'll find some TE families that are listed as genes because of inconsistency in approaches. I think this points to a need to start at the beginning for each genome and generate this profile with the same methods.

To do it right would need to do something consistent at the outset for the genomes -- we are exploring this in the 1000 fungal genome set as part of an analysis so can report back in time.

MichaelFokinNZ commented 7 years ago

@hyphaltip are there 1000 genomes project species all present at JGI website and processed through JGI annotation pipeline (which is I believe different from the chart above taken form some old source)?

Diversity of sources and quality of public-NCBI fungal genomes sometimes drives me mad. It is why working with weakly sequenced family I am trying to do all annotations and assemblies (if possible) myself to get something comparable. I've also found that even assembly cleanup (min contig size 1000->200bp) can drastically change the number of genes (12->14k). Have no idea at the moment what kind of genes are added.

MichaelFokinNZ commented 7 years ago

@nextgenusfs what about adding some contamination checks? even some basic like VecScreen can help with further smooth NCBI submission...

nextgenusfs commented 7 years ago

I view that contamination screening etc as something clearly necessary, which needs to be done on any assembly. Only after you have a decent assembly should you use funannotate. In fact, you likely should submit your assembly to NCBI prior to annotation to have them quality check it, as once you start annotating it is difficult to change scaffolds/contigs later on. That way you can also get assigned a locus_tag from NCBI and use that for your annotation with funannotate.

My usual assembly workflow to go from PE Illumina is JAAWS (Jon's Awesome Assembly Wrapper Script), of course this doesn't always do a good job and is dependent on what sort of data you have. I have been meaning to convert this to Python for portability, but haven't had time yet. But you should see from the bash script what it is doing. Essentially run Spades, then blobtools for filtering contamination and high/low coverage contigs, then mito/Euk/Prot/Vec screen, drop contigs < 1kb, and then runs 5 iterations of Pilon.

MichaelFokinNZ commented 7 years ago

@nextgenusfs what you think about preserving short contigs? (>200), because I've found that keeping them can strongly increase number of genes...

MichaelFokinNZ commented 7 years ago

also what is the rationale of removing mitochondrion?

nextgenusfs commented 7 years ago

Any contigs that are less then the length of your reads aren't really "contigs", as they aren't assembled into anything and thus have very little support suggesting that they are part of your genome. I think 1 kb is reasonable if you have PE illumina data with ~ 300-500 bp inserts. You shouldn't be leaving out any real information.

nextgenusfs commented 7 years ago

You can typically assemble the entire mitochondrial genome with PE ILlumina, but you can't annotate it the same as it uses a different genetic code, etc. You can certainly recover it, but per NCBI WGS submission rules, mitochondrial genomes should be a separate submission.

MichaelFokinNZ commented 7 years ago

ops... missed this

NCBI WGS submission rules, mitochondrial genomes should be a separate submission.

MichaelFokinNZ commented 7 years ago

do you normally compare (BUSCO on assembly) vs (BUSCO on annotated genome)?

nextgenusfs commented 7 years ago

Originally everybody used CEGMA to determine how "complete" there assembly was. BUSCO is now what people use to validate that they have done a decent job on assembly. Some genomes are not amendable to an easy assembly, thus they can be fragmented. But fragmented genomes (given a decent N50 value) can produce a decent draft annotation as well. @hyphaltip and his post doc @ocisse have developed something similar to BUSCO but geared toward fungi, FGMP, so there are several different ways to assess your assembly prior to full-on annotation.

I see less of a value in comparing the BUSCO in genome mode versus BUSCO in protein mode. funannotate does both, but largely because it is some easy-to-get annotation to add. I also then cross-reference the single-copy orthologous proteins with BUSCO models to draw phylogenetic trees in funannotate compare.

MichaelFokinNZ commented 7 years ago

Thank you much! Your place in my thesis's Acknowledgements chapter is secured :)

MichaelFokinNZ commented 7 years ago

Couple more questions to clarify.

  1. How can I get the number of transcripts mapped and passed to EVM? Are they only those mapped by GMAP and BLAT both?

  2. With Augustus usage... having transcripts/proteins and pasa evidences, are they all passed to Augustus (what I have to reflect on chart)? While GeneMark-ET is not supplied by mapped evidences? Right? So, the HiQ augustus models are derived by comparison of ab initio augustus models with ALL these evidences above? I feel a bit of logical circling about using these evidences for gene prediction and later for validation... (of ~13,7k Augustus models 10,4k became HiQ in my case, is it something expected?).

nextgenusfs commented 7 years ago

The evidence (transcripts/proteins) are currently handled this way:

  1. transcripts are mapped to genome using GMAP -> these mappings are saved in transcript_alignments.gff3 - they are passed directly to EVM later on
  2. transcripts are also aligned using BLAT, these alignments are filtered for quality using pslCDnaFilter
  3. proteins are aligned to genome using exonerate (preliminary regions are identified with either tblastn or diamond)
  4. exonerate alignments are converted to EVM GFF3 format and then passed directly to EVM later on
  5. protein alignments from exonerate and transcript alignments from BLAT are converted to Augustus evidence hints and used during the Augustus de novo prediction. Note that Augustus will be trained in different ways depending on the input (it is only trained if training set doesn't already exist), so "best" first is 1) if --rna_bam is passed then it is trained using BRAKER1, 2) if no --rna_bam but --pasa_gff then Augustus trained using PASA gene models, 3) if neither --rna_bam nor --pasa_gff, then BUSCO2 is run. Note that in BUSCO2 mediated training, the hints from transcripts/exonerate are used as well as GeneMark-ES models in the initial BUSCO2 run, which are then used as gene models to train Augustus. This is different than standard BUSCO training.
  6. GeneMark is either run in ES mode (self training) or is run in ET mode only if RNA-seq alignments are passed via coord sorted BAM file. I could probably run in ET mode without RNA-seq alignments, but I've found that GeneMark-ES does a pretty good job in self-training mode, note this is not true of Augustus where training is critical to getting reasonable gene models.
  7. After Augustus is finished, the results are parsed to find gene model predictions that have > 90% exon evidence, using the hints file passed to Augustus, thus it is using exonerate/Blat data to find high quality gene models -> HiQ
  8. Finally EVM is passed all of the gene models from Augustus and GeneMark (weighted 1), Augustus HiQ (weighted 5), and transcript alignments from GMAP (weighted 1), protein alignments from exonerate (weighted 1). If PASA/transdecoder gene models are given as input, they are passed directly to EVM (weighted 10).
  9. Critically EVM is then run using a minimum intron length of 10 (anything shorter will trigger gene model error at NCBI - this is the main reason that many Maker2 gene models fail NCBI).

You can get a fairly accurate although perhaps slightly "rough" estimation of transcripts that mapped by running the following:

#this will count GMAP alignments
grep -c '^###' transcript_alignments.gff3
#to count Blat/pslCDnaFilter alignments
wc -l blat.sort.psl
MichaelFokinNZ commented 7 years ago

thank you much! yes, I did about the same to count transcripts. So, BLAT alignments are used only for Augustus and they are not passed to EVM? What proportion of Augustus models do you normally expect to be converted into HiQ?

nextgenusfs commented 7 years ago

The BLAT and GMAP alignments are mostly redundant, basically BLAT is used for Augustus because there is a method for that built into Augustus (essentially it is looking for intron/exon boundaries to train Augustus). GMAP output integrates better with EVM. The HiQ Augustus models depend on what you have used for evidence... could be anywhere from 10% up to nearly 100%. The idea is sort of based off of that if Augustus is very well trained, it actually is very accurate, so I'm trying to leverage that accuracy to weight Augustus higher than GeneMark etc when Augustus is well trained. When Augustus is not trained very well, it doesn't do a very good job.

MichaelFokinNZ commented 7 years ago

is it more correct representation of an evidence usage?

picture1

nextgenusfs commented 7 years ago

Yeah getting better. Maybe just get rid of the "merge" and put the tRNA predictions underneath Evidence Modeler? And then the two filtering steps post EVM are 1) filter repeats and length and then 2) filter tbl2asn flagged models.

MichaelFokinNZ commented 7 years ago

with the total number of gene models in funannotate_predict.log submitted to EVM... how is it calculated?

For the case below I have ~3x12k from ab-initio, ~1k from Exonerate, ~75k GMAP and ~40k PASA, which doesn't sum up to 77k...

2017-04-20 03:55:42,418: Converting GeneMark GTF file to GFF3
2017-04-20 03:55:43,079: Found 11,378 gene models
2017-04-20 03:55:43,079: perl /home/osboxes/.linuxbrew/opt/evidencemodeler/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl fun_out_v04/predict_misc/genemark.gff
2017-04-20 03:55:44,847: Pulling out high quality Augustus predictions
2017-04-20 03:55:45,169: Found 10482 high quality predictions from Augustus (>90% exon evidence)
**2017-04-20 03:56:06,079: 77,995 total gene models from all sources**
2017-04-20 05:05:21,526: 14,264 total gene models from EVM
nextgenusfs commented 7 years ago

So that is the number of gene models in the gene_predictions.gff3 file, the count is a simple function counting the number of genes. The evidence mappings are not complete gene models, but are used by EVM to weight the intron/exon boundaries, etc (at least that is how I imagine it works...).

def countGFFgenes(input):
    count = 0
    with open(input, 'rU') as f:
        for line in f:
            if "\tgene\t" in line:
                count += 1
    return count

Looks like you are on your way to a nice annotation! Lots of good evidence!

MichaelFokinNZ commented 7 years ago

Nice annotation was v03 submitted to NCBI, I am already on v05 for in-house use. NCBI is releasing anno just now, so fingers crossed. Yes, and the pipeline is much better than Maker i used before :)

After having the anno submitted to NCBI, protein SeqIDs are somehow fixed, what makes a bit difficult updating them in bulk. I remember there were couple of Maker scripts for renaming, do you use something like?

nextgenusfs commented 7 years ago

The protein ids will be given XREF numbers or some equivalent by NCBI, however you're locus_tag should never change. Funannotate does use two Maker scripts for renaming the GFF gene models at the end of the predict stage, however, protein IDs are not altered by these scripts. GAG outputs a default name for proteins as well, so no real way to customize there. Basically, I end up use the locus_tag for proteins/transcripts/genes when working with the data after the fact. I have some genbank2fasta conversion scripts if you want them after you get your submission online I can show where where they are (basically in my genome_scripts repository along with a lot of other not very well documented scripts).