Closed MichaelFokinNZ closed 4 years ago
Is it correct overall representation of the funannotate_predict pipeline (with some extras)? Design is draft and I will also highlight somehow steps wrapped into funannotate.... May be it is better to divide Genemark and Augustus predictions...
I especially like how the important steps to make the consensus gene models is missing from the JGI chart or how the different evidence is weighed...
Glad you taking a stab at this.
Also your chart should probably more generic unless this is for you paper otherwise people will think you need RNA or need 3 RNAseq libraries...
Some data are optional:
Transdecoder is not needed to make gene calls - ORFs are only considered at the later stage of annotation/prediction
-- if you contrast with MAKER evidendence for exons based on EST and protein alignments are fed into ab initio prediction for Augustus which improve their scoring
Might be useful to show the default EVM weights assigned to each of the evidence being input.
I'm unclear what "Merge" means. @nextgenusfs has a transposase filtering step after the proteins are predicted to filter out Tpases - would be good to clarify parameters of that and what stage of the data that is performed on.
The assignment of function / description to gene products is also important, the swissprot desc for example and any cleanup performed on that after the fact could be documented as well.
Generally the flow-gram for funannotate predict is accurate. @hyphaltip points out a few things that are important if this were to be a "general" flow gram for funannotate.
To clarify a bit on the evidence usage: funannotate does use the mapped evidence (ESTs, transcripts, proteins) during Augustus (of course only if that evidence is passed). The more evidence you can pass, theoretically the annotation should improve. The other thing that it does is pull out high quality annotations from Augustus, which are defined as gene models where > 90% of the exons are supported by evidence. These are then passed to EVM and given 5X more weight than the the rest of the gene models. The intent here is to tilt the EVM scoring matrix in favor of these high quality predictions. If PASA/Transdecoder gene models are passed they are given an EVM weight of 10, while de novo predictions and EST/protein mapping evidence is all weighted at 1.
The transposase/repeats filtering is done in two steps: 1) is a protein Blast search against known transposases from RepBase + Transposon PSI database and 2) it flags/removes gene models that are > 90% contained inside a masked repetitive element. This is quite a bit different than what Maker does. I do this method because highly repetitive genomes are overly masked by the approach in Maker (RepeatRunner) which results in lots of likely real gene models that are not annotated. The original genome I wrote this for has ~ 40% repeats, so nearly 10% of the final gene models were not called by Maker due to the repeat masking approach.
Yes, i've added some extras/optional steps, just because i am describing how i am using funannotate (a bit selfish:). It is for predict stage only, since annotate is more custom and straightforward.
With HQ Augustus models - are they tested against transcripts/proteins or also pasa? So, there are two overlapping outputs of Augustus - normal EVM_1 models and HQ EVM_5 models? Right? And this EVM_5 weight is not reflected in the EVM_weights.txt file in predict_misc folder (not exact filename, i am currently not at computer)? Does it work in the same way if pasa output provided?
With repeat-masking: does it mean that soft? masked sequences are also annotated through all pipeline and gene models models inside them are removed only at the latest (what?) stage?
And thank you much for explaining the differences with Maker i used before.
The HiQ models are only used if you provide evidence, i.e. weights with default settings should look something like this:
ABINITIO_PREDICTION Augustus 1
ABINITIO_PREDICTION GeneMark 1
OTHER_PREDICTION HiQ 5
PROTEIN exonerate 1
If PASA/transecoder is passed, then weights would look like:
ABINITIO_PREDICTION Augustus 1
ABINITIO_PREDICTION GeneMark 1
OTHER_PREDICTION HiQ 5
PROTEIN exonerate 1
OTHER_PREDICTION transdecoder 10
The in silico predictors deal with the soft-masking genomes internally, but yes they are not hard-masked so gene models can be predicted in those regions. After EVM runs, the models are then filtered for repeats (as above). Then the models are run through tbl2asn, some errors are automatically parsed and a few more common mistakes are removed. Then final GFF/GBK is created. The filtering is challenging as I do my best to automatically parse the NCBI error report, but NCBI will change this often -> for example the tRNAs > 150bp didn't used to be an error, etc.
Thank you, so now you can just copy-paste posts from this thread to wiki :)
The updated flow-chart (not the final design)
Yes I know the wiki isn't great.... need to find some time to write this whole thing up as well as update wiki/manual. I feel like I'm getting closer to a v1.0 release. A few more bugs worked out should be nearly there.
And plans to include (if not yet there) RNAmmer like software?
Oh, BUSCO proteins missed in my chart, are they used in funannotate_predict other than for Augustus training in the absence of RNAseq?
And one more question - are there transposomes (and other repetitive elements) preserved in the final annotation? In general I would like to have them there, not for NCBI, but for my own use...
RNAmmer - most rRNA/rDNA scaffolds should be removed from draft genome assemblies as the order/copy number/structure/etc is nearly impossible to assemble correctly with short reads. Perhaps long read technology will change this, but you likely should be dropping these contigs/scaffolds from the assembly prior to annotating. And RNAmmer is linux only....
BUSCO2 is only used to train Augustus in funannotate predict.
If you are interested in repeats/transposable elements I would use an additional package to do a more complete search for repeats than just RepeatModeler/RepeatMasker. @hyphaltip may have some suggestions for finding/annotating repetitive elements. But typically they are left out of annotated NCBI genomes.
REPET is more comprehensive repeat analysis I would use if you are truly interested in characterizing the repeat diversity in the fungi. Otherwise RepeatModeler + RepeatMasker and then the post-prediction cleanup seems good enough tradeoff of speed and sensitivity to remove things which are contaminating typical studies of protein coding gene content.
If you do a comparative analysis / ortholog clustering of genomes annotated by different sources you'll find some TE families that are listed as genes because of inconsistency in approaches. I think this points to a need to start at the beginning for each genome and generate this profile with the same methods.
To do it right would need to do something consistent at the outset for the genomes -- we are exploring this in the 1000 fungal genome set as part of an analysis so can report back in time.
@hyphaltip are there 1000 genomes project species all present at JGI website and processed through JGI annotation pipeline (which is I believe different from the chart above taken form some old source)?
Diversity of sources and quality of public-NCBI fungal genomes sometimes drives me mad. It is why working with weakly sequenced family I am trying to do all annotations and assemblies (if possible) myself to get something comparable. I've also found that even assembly cleanup (min contig size 1000->200bp) can drastically change the number of genes (12->14k). Have no idea at the moment what kind of genes are added.
@nextgenusfs what about adding some contamination checks? even some basic like VecScreen can help with further smooth NCBI submission...
I view that contamination screening etc as something clearly necessary, which needs to be done on any assembly. Only after you have a decent assembly should you use funannotate. In fact, you likely should submit your assembly to NCBI prior to annotation to have them quality check it, as once you start annotating it is difficult to change scaffolds/contigs later on. That way you can also get assigned a locus_tag from NCBI and use that for your annotation with funannotate.
My usual assembly workflow to go from PE Illumina is JAAWS (Jon's Awesome Assembly Wrapper Script), of course this doesn't always do a good job and is dependent on what sort of data you have. I have been meaning to convert this to Python for portability, but haven't had time yet. But you should see from the bash script what it is doing. Essentially run Spades, then blobtools for filtering contamination and high/low coverage contigs, then mito/Euk/Prot/Vec screen, drop contigs < 1kb, and then runs 5 iterations of Pilon.
@nextgenusfs what you think about preserving short contigs? (>200), because I've found that keeping them can strongly increase number of genes...
also what is the rationale of removing mitochondrion?
Any contigs that are less then the length of your reads aren't really "contigs", as they aren't assembled into anything and thus have very little support suggesting that they are part of your genome. I think 1 kb is reasonable if you have PE illumina data with ~ 300-500 bp inserts. You shouldn't be leaving out any real information.
You can typically assemble the entire mitochondrial genome with PE ILlumina, but you can't annotate it the same as it uses a different genetic code, etc. You can certainly recover it, but per NCBI WGS submission rules, mitochondrial genomes should be a separate submission.
ops... missed this
NCBI WGS submission rules, mitochondrial genomes should be a separate submission.
do you normally compare (BUSCO on assembly) vs (BUSCO on annotated genome)?
Originally everybody used CEGMA to determine how "complete" there assembly was. BUSCO is now what people use to validate that they have done a decent job on assembly. Some genomes are not amendable to an easy assembly, thus they can be fragmented. But fragmented genomes (given a decent N50 value) can produce a decent draft annotation as well. @hyphaltip and his post doc @ocisse have developed something similar to BUSCO but geared toward fungi, FGMP, so there are several different ways to assess your assembly prior to full-on annotation.
I see less of a value in comparing the BUSCO in genome mode versus BUSCO in protein mode. funannotate does both, but largely because it is some easy-to-get annotation to add. I also then cross-reference the single-copy orthologous proteins with BUSCO models to draw phylogenetic trees in funannotate compare.
Thank you much! Your place in my thesis's Acknowledgements chapter is secured :)
Couple more questions to clarify.
How can I get the number of transcripts mapped and passed to EVM? Are they only those mapped by GMAP and BLAT both?
With Augustus usage... having transcripts/proteins and pasa evidences, are they all passed to Augustus (what I have to reflect on chart)? While GeneMark-ET is not supplied by mapped evidences? Right? So, the HiQ augustus models are derived by comparison of ab initio augustus models with ALL these evidences above? I feel a bit of logical circling about using these evidences for gene prediction and later for validation... (of ~13,7k Augustus models 10,4k became HiQ in my case, is it something expected?).
The evidence (transcripts/proteins) are currently handled this way:
transcript_alignments.gff3
- they are passed directly to EVM later onpslCDnaFilter
--rna_bam
is passed then it is trained using BRAKER1, 2) if no --rna_bam
but --pasa_gff
then Augustus trained using PASA gene models, 3) if neither --rna_bam
nor --pasa_gff
, then BUSCO2 is run. Note that in BUSCO2 mediated training, the hints from transcripts/exonerate are used as well as GeneMark-ES models in the initial BUSCO2 run, which are then used as gene models to train Augustus. This is different than standard BUSCO training.You can get a fairly accurate although perhaps slightly "rough" estimation of transcripts that mapped by running the following:
#this will count GMAP alignments
grep -c '^###' transcript_alignments.gff3
#to count Blat/pslCDnaFilter alignments
wc -l blat.sort.psl
thank you much! yes, I did about the same to count transcripts. So, BLAT alignments are used only for Augustus and they are not passed to EVM? What proportion of Augustus models do you normally expect to be converted into HiQ?
The BLAT and GMAP alignments are mostly redundant, basically BLAT is used for Augustus because there is a method for that built into Augustus (essentially it is looking for intron/exon boundaries to train Augustus). GMAP output integrates better with EVM. The HiQ Augustus models depend on what you have used for evidence... could be anywhere from 10% up to nearly 100%. The idea is sort of based off of that if Augustus is very well trained, it actually is very accurate, so I'm trying to leverage that accuracy to weight Augustus higher than GeneMark etc when Augustus is well trained. When Augustus is not trained very well, it doesn't do a very good job.
is it more correct representation of an evidence usage?
Yeah getting better. Maybe just get rid of the "merge" and put the tRNA predictions underneath Evidence Modeler? And then the two filtering steps post EVM are 1) filter repeats and length and then 2) filter tbl2asn flagged models.
with the total number of gene models in funannotate_predict.log submitted to EVM... how is it calculated?
For the case below I have ~3x12k from ab-initio, ~1k from Exonerate, ~75k GMAP and ~40k PASA, which doesn't sum up to 77k...
2017-04-20 03:55:42,418: Converting GeneMark GTF file to GFF3
2017-04-20 03:55:43,079: Found 11,378 gene models
2017-04-20 03:55:43,079: perl /home/osboxes/.linuxbrew/opt/evidencemodeler/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl fun_out_v04/predict_misc/genemark.gff
2017-04-20 03:55:44,847: Pulling out high quality Augustus predictions
2017-04-20 03:55:45,169: Found 10482 high quality predictions from Augustus (>90% exon evidence)
**2017-04-20 03:56:06,079: 77,995 total gene models from all sources**
2017-04-20 05:05:21,526: 14,264 total gene models from EVM
So that is the number of gene models in the gene_predictions.gff3
file, the count is a simple function counting the number of genes. The evidence mappings are not complete gene models, but are used by EVM to weight the intron/exon boundaries, etc (at least that is how I imagine it works...).
def countGFFgenes(input):
count = 0
with open(input, 'rU') as f:
for line in f:
if "\tgene\t" in line:
count += 1
return count
Looks like you are on your way to a nice annotation! Lots of good evidence!
Nice annotation was v03 submitted to NCBI, I am already on v05 for in-house use. NCBI is releasing anno just now, so fingers crossed. Yes, and the pipeline is much better than Maker i used before :)
After having the anno submitted to NCBI, protein SeqIDs are somehow fixed, what makes a bit difficult updating them in bulk. I remember there were couple of Maker scripts for renaming, do you use something like?
The protein ids will be given XREF numbers or some equivalent by NCBI, however you're locus_tag should never change. Funannotate does use two Maker scripts for renaming the GFF gene models at the end of the predict stage, however, protein IDs are not altered by these scripts. GAG outputs a default name for proteins as well, so no real way to customize there. Basically, I end up use the locus_tag for proteins/transcripts/genes when working with the data after the fact. I have some genbank2fasta conversion scripts if you want them after you get your submission online I can show where where they are (basically in my genome_scripts
repository along with a lot of other not very well documented scripts).
Hi, do you have may be a kind of box chart for funannotate pipeline? Smth like![image](https://cloud.githubusercontent.com/assets/7625343/25372754/cb070c92-29ea-11e7-9036-597275aea349.png)