ogotoh / spaln

Genome mapping and spliced alignment of cDNA or amino acid sequences
GNU General Public License v2.0
90 stars 14 forks source link

Parameter optimization for non-model species (max gene size and so on) #16

Open Secretloong opened 5 years ago

Secretloong commented 5 years ago

Hi Gotoh,

Recently, I'm trying to replace genewise with your spaln. I found spaln could get more expected gene structure as query protein. It seems great! To apply in non-model and non-reference genome, I think much more specific parameter would be better. There are several questions I need your help:

  1. how to determine "-XG# Maximum expected gene size"? Is it mRNA length or gene (with intron) length? For non-annotated genome, how to get a better estimation?
  2. -Q#, I think I need use 4-7 for whole genome annotation, and 0-3 for partial genome annotation (the genomic segments)? Am I right?
  3. For "-T$ Subdirectory where species-specific parameters reside", could I train my own species-specific parameters? And do you have any information about training?
  4. I found you delete "-yx# Penalty for a frame shift (100)" in version 2.3.3c. Does it work now?
  5. Do you have any suggestion about annotating a new genome only by spaln (only considerate the similarity-based annotation method)?

Thank you very much! And looking forward your responses ASAP.

ogotoh commented 5 years ago
  1. The expected gene size with intron (Expected_Max_Gene_Size) should be assigned. When the genome size (Genome_Size) is known, spaln simply estimates Expected_Max_Gene_Size ~= sqrt(Genome_Size). Underestimation of Expected_Max_Gene_Size can cause serious problems for prediction of long genes, while the harmful effects of overestimation is marginal, except for slight reduction in overall performance rate.
  2. Yes, you are right. 3-1. In our experience, spaln is not highly sensitive to the parameter values. For example, a single parameter set seems fine for all tetrapods from coelacanth to human. In most cases, you may find a proper species in the list of “SPALN_ROOT/table/gnm2tab”. 3-2. I use a home-made pipeline to estimate the parameter values if a sufficient number of transcript (cDNA, EST) sequences are available to yield at least 1000 (preferably more) reliable exon-intron boundaries. Unfortunately, the pipeline is rather messy, and currently I do not intend to have it publicly available. If you somewhere upload your genomic sequences and transcript sequences, I will try to estimate your own parameter set. 3-3. Essential procedures have been published in Iwata and Gotoh (2011) BMC Genomics, 12, 45, and see also Gotoh, O. (2018) Bioinformatics, 34 (19) 3258-3264. Note that, by default, spaln does not consider “intron potential” and “branch point signal” described in the former report.
  3. The -yx# option still works. I did not intend to delete that option, and it was a simple mistake in editing the help message.
  4. This is a good question. I am just trying to extend our previous pipeline (Gotoh et al. BMC Bioinformatics, 15, 189) to whole genome peptide-coding gene prediction. Of course, success of such trials depends on the presence of good reference genome and annotation. However, I think it may deserve an upcoming challenge.
Secretloong commented 5 years ago

Thank you, Gotoh. Very helpful to understand Spaln.

ogotoh commented 5 years ago

I made a mistake in 1. Actually, Expected_Max_Gene_Size ~= Const * sqrt(Genome_Size), where Const is empirically determined to be 36.

ogotoh commented 5 years ago

I made a mistake in 1. Actually, Expected_Max_Gene_Size ~= Const * sqrt(Genome_Size), where Const is empirically determined to be 36.

Secretloong commented 5 years ago

It makes sense. So block size~= sqrt(Genome_Size), Expected_Max_Gene_Size ~= Const * sqrt(Genome_Size), where Const is empirically determined to be 36.

ogotoh commented 5 years ago

Yes, that's right.