Parameter optimization for non-model species (max gene size and so on)

Secretloong commented 5 years ago

Hi Gotoh,

Recently, I'm trying to replace genewise with your spaln. I found spaln could get more expected gene structure as query protein. It seems great! To apply in non-model and non-reference genome, I think much more specific parameter would be better. There are several questions I need your help:

how to determine "-XG# Maximum expected gene size"? Is it mRNA length or gene (with intron) length? For non-annotated genome, how to get a better estimation?
-Q#, I think I need use 4-7 for whole genome annotation, and 0-3 for partial genome annotation (the genomic segments)? Am I right?
For "-T$ Subdirectory where species-specific parameters reside", could I train my own species-specific parameters? And do you have any information about training?
I found you delete "-yx# Penalty for a frame shift (100)" in version 2.3.3c. Does it work now?
Do you have any suggestion about annotating a new genome only by spaln (only considerate the similarity-based annotation method)?

Thank you very much! And looking forward your responses ASAP.

ogotoh commented 5 years ago

The expected gene size with intron (Expected_Max_Gene_Size) should be assigned. When the genome size (Genome_Size) is known, spaln simply estimates Expected_Max_Gene_Size ~= sqrt(Genome_Size). Underestimation of Expected_Max_Gene_Size can cause serious problems for prediction of long genes, while the harmful effects of overestimation is marginal, except for slight reduction in overall performance rate.
Yes, you are right. 3-1. In our experience, spaln is not highly sensitive to the parameter values. For example, a single parameter set seems fine for all tetrapods from coelacanth to human. In most cases, you may find a proper species in the list of “SPALN_ROOT/table/gnm2tab”. 3-2. I use a home-made pipeline to estimate the parameter values if a sufficient number of transcript (cDNA, EST) sequences are available to yield at least 1000 (preferably more) reliable exon-intron boundaries. Unfortunately, the pipeline is rather messy, and currently I do not intend to have it publicly available. If you somewhere upload your genomic sequences and transcript sequences, I will try to estimate your own parameter set. 3-3. Essential procedures have been published in Iwata and Gotoh (2011) BMC Genomics, 12, 45, and see also Gotoh, O. (2018) Bioinformatics, 34 (19) 3258-3264. Note that, by default, spaln does not consider “intron potential” and “branch point signal” described in the former report.
The -yx# option still works. I did not intend to delete that option, and it was a simple mistake in editing the help message.
This is a good question. I am just trying to extend our previous pipeline (Gotoh et al. BMC Bioinformatics, 15, 189) to whole genome peptide-coding gene prediction. Of course, success of such trials depends on the presence of good reference genome and annotation. However, I think it may deserve an upcoming challenge.

Secretloong commented 5 years ago

Thank you, Gotoh. Very helpful to understand Spaln.

ogotoh commented 5 years ago

I made a mistake in 1. Actually, Expected_Max_Gene_Size ~= Const * sqrt(Genome_Size), where Const is empirically determined to be 36.

ogotoh commented 5 years ago

I made a mistake in 1. Actually, Expected_Max_Gene_Size ~= Const * sqrt(Genome_Size), where Const is empirically determined to be 36.

Secretloong commented 5 years ago

It makes sense. So block size~= sqrt(Genome_Size), Expected_Max_Gene_Size ~= Const * sqrt(Genome_Size), where Const is empirically determined to be 36.

ogotoh commented 5 years ago

Yes, that's right.

ogotoh / spaln

Parameter optimization for non-model species (max gene size and so on) #16