nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Failed exonerate alignments found #179

Closed ignadb closed 5 years ago

ignadb commented 6 years ago

Hi Jon,

Thanks for funannotate and for your continuing support!

I was running version 1.1.1 to annotate a fungal genome. It went well with beautiful results, but I got one warning which I would like to have your opinion. I listed the command I used and reports from funannotate below. At [08:05:27 AM], it informed that it finished exonerate before instantly throwing a flag that failed exonerate alignments found. Do you think the generated results are still okay? Or is there anything I should be aware of?

Thank you very much and looking forward to hearing from you.

------------------------------------------------------
server@P910:~/software/funanno2018/funannotate$ ./funannotate.py predict -i Hscutula/final.GCA_001399465.1_H_scutula_V1_genomic.fna -o Hscutula/funannotate2018 -s "Hymenoscyphus scutula" --augustus_species HfraxineusKW2v2 --cpus 4
-------------------------------------------------------
[12:09:07 AM]: OS: linux2, 32 cores, ~ 132 GB RAM. Python: 2.7.12
[12:09:07 AM]: Running funannotate v1.1.1
[12:09:08 AM]: Augustus training set for HfraxineusKW2v2 already exists, using existing parameters.
        If you want to re-train, provide a unique name for the --augustus_species argument
[12:09:09 AM]: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER and BUSCO
[12:09:10 AM]: Loading sequences and soft-masking genome
[12:09:10 AM]: Soft-masking: building RepeatModeler database
[12:09:12 AM]: Soft-masking: generating repeat library using RepeatModeler
[05:24:22 AM]: Soft-masking: running RepeatMasker with custom library
[05:44:57 AM]: Masked genome: 2,396 scaffolds; 62,868,809 bp; 26.66% repeats masked
[05:44:57 AM]: No transcripts available to generate BLAT Augustus hints, provide --transcript_evidence
[05:45:04 AM]: Mapping proteins to genome using Diamond blastx/Exonerate
[05:45:04 AM]: Using 544,324 proteins as queries
[05:45:04 AM]: Running Diamond pre-filter search
[05:53:26 AM]: Found 510,540 preliminary alignments
[08:05:27 AM]: Exonerate finished: found 1,416 alignments
[08:05:27 AM]: Failed exonerate alignments found, see files in p2g_10955/failed
[08:05:28 AM]: Running GeneMark-ES on assembly
[08:48:13 AM]: Converting GeneMark GTF file to GFF3
[08:48:14 AM]: Found 15,097 gene models
[08:48:17 AM]: Running Augustus gene prediction
[09:05:37 AM]: Found 14,316 gene models
[09:05:40 AM]: Pulling out high quality Augustus predictions
[09:05:41 AM]: Found 104 high quality predictions from Augustus (>90% exon evidence)
[09:05:43 AM]: Summary of gene models passed to EVM (weights):
-------------------------------------------------------
Augustus models (1):    14,316 
GeneMark models (1):    15,097
Hi-Q models (5):    118
PASA gene models (10):  0
Other gene models (1):  0
Total gene models:  29,531
-------------------------------------------------------
[09:05:43 AM]: Setting up EVM partitions
[09:06:10 AM]: Generating EVM command list
[09:06:10 AM]: Running EVM commands with 3 CPUs
[09:32:45 AM]: Combining partitioned EVM outputs
[09:32:46 AM]: Converting EVM output to GFF3
[09:36:02 AM]: Collecting all EVM results
[09:36:02 AM]: 15,204 total gene models from EVM
[09:36:02 AM]: Generating protein fasta files from 15,204 EVM models
[09:36:12 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[09:36:35 AM]: Found 308 gene models to remove: 5 too short; 0 span gaps; 408 transposable elements
[09:36:35 AM]: 14,896 gene models remaining
[09:36:35 AM]: Predicting tRNAs
[09:38:23 AM]: Found 155 tRNA gene models
[09:38:23 AM]: 152 tRNAscan models are valid (non-overlapping)
[09:38:23 AM]: Generating GenBank tbl annotation file
[09:38:24 AM]: Converting to final Genbank format
[09:40:51 AM]: Collecting final annotation files for 15,048 total gene models
[09:40:51 AM]: Funannotate predict is finished, output files are in the Hscutula/funannotate2018/predict_results folder
[09:40:51 AM]: Your next step might be functional annotation, suggested commands:
-------------------------------------------------------
Run InterProScan (Docker required): 
funannotate iprscan -i Hscutula/funannotate2018 -m docker -c 4

Run antiSMASH: 
funannotate remote -i Hscutula/funannotate2018 -m antismash -e youremail@server.edu

Annotate Genome: 
funannotate annotate -i Hscutula/funannotate2018 --cpus 4 --sbt yourSBTfile.txt
-------------------------------------------------------
nextgenusfs commented 6 years ago

Hi @ignadb, that "error" isn't something to worry about. It is telling you that at least 1 exonerate alignment failed -- this is usually because of an incompatibility with one of the UniProtKb/SwissProt proteins and exonerate. But everything else seems to look okay to me. You might improve annotation by passing some --transcript_evidence -- if you don't have RNA-seq, these can also be EST's from closely related organisms (for fungal species I typically download the clustered EST sequences from JGI mycocosm for the entire family, cluster them, and then use those as transcript evidence)

ignadb commented 6 years ago

Thanks so much for your quick reply and suggestion! :) At the moment, I am using RNA-Seq data from one of the species I am working with to build gene models for Augustus and use the gene models with other species (without using the --transcript_evidence flag). The good thing is that all the species I am working with are in the same genus. What do you think if I assembly the RNA-Seq data de novo and use it with --transcript_evidence for other species?

nextgenusfs commented 6 years ago

I would not use the ab-initio gene models (i.e. from Augustus or even a funannotate run) to use as evidence for another species - as these are a prediction and you might end up over-fitting the training parameters. You should get better results by training Augustus for each species based on alignment of real evidence (i.e. well curated proteins and/or transcripts). Even closely related species can have different Augustus parameters. If you don't have RNA-seq for a species, funannotate defaults to using BUSCO2 predictions and any mapped evidence to train Augustus - typically this works well. You can use the trinity transcripts from one species to align to another - the threshold for mapping evidence is a percent identity of 80%, so some transcripts may not map, but the conserved ones likely should (which are the ones you want for training anyway). You can also pass multiple transcripts at runtime by separating the files by a space, i.e. --transcript_evidence trinity.fasta myESTs.fa.

ignadb commented 6 years ago

Sorry for putting you too many questions; I just wanted to make it clear. I understand that the trinity transcripts you talked about are the de novo-assembled transcripts that I have for one of the species I am working with. Is this correct? And if so, in my case it is better to feed the trinity transcripts in --transcript_evidence and let --augustus_species free. Please correct me if I am wrong.

nextgenusfs commented 6 years ago

Yes, when --augustus_species is not defined, then it is generated as a combination of the --species parameter and --isolate or --species. For example:

-s "Aspergillus fumigatus" --isolate AB1234

Would result in the script training Augustus and will use aspergillus_fumigatus_AB1234 as the new species name in the Augustus config folder.

This will then run BUSCO2 mediated training that is supplemented with the protein/transcript alignments. You can additionally use your RNA-seq trained species as a "seed species" for running BUSCO, you would do that by passing the Augustus training species name to --busco_seed_species. This will then use the --busco_seed_species as the initial training parameters for BUSCO2 and it will then update those parameters for the new species it is training.

ignadb commented 6 years ago

Ah, okay. Thank you very much Jon!

nextgenusfs commented 6 years ago

Note you can also use the RNA-seq modules in funannotate (funannotate train) - and you should probably upgrade to the newest version if you are able to as there are bug fixes and some better functionality.

ignadb commented 6 years ago

I will talk with our IT support next week and ask them for the newest upgrade. :) Thanks so much again Jon! 👍

nextgenusfs commented 5 years ago

v1.4.1 released