ogotoh / spaln

Genome mapping and spliced alignment of cDNA or amino acid sequences
GNU General Public License v2.0
94 stars 16 forks source link

short gene for spaln output #39

Open fangdm opened 3 years ago

fangdm commented 3 years ago

Hi, many thanks for making and maintaining this great program. I've run spaln for a performance test, using human genome and its own proteins set with different size from 30 to 5000 (amino acids). It's a great tool! It runs quickly and less gene error messages. However, I noticed a problem that short genes (< 80 aa) had no result except for the following report: ENSP00000248150 < 0 67 12 11607 11606 0.00 6.59 24 30 2 2 6 5 ENSP00000480507 > 0 40 12 12387 12388 0.00 0.00 18 11 1 0 3 3 ENSP00000300873 > 0 70 13 19963 19963 10.54 6.68 32 37 2 2 6 6 ENSP00000477297 > 0 50 10 3146 3146 17.06 0.00 35 23 1 1 5 3 ENSP00000412536 > 0 30 13 21852 21852 12.02 6.68 15 5 1 0 3 1 … And I had tried to modify some parameters, for example: -LS, -Xk3, -Xs3 and -yx0, but there is still no any result, so how to solve the problem. Here are my commands: makeidx.pl -ip Hsap.mfa.gz spaln -O0 -Q7 -pw -T homosapi -t12 -M4 -dHsap Hsap_test1.pep > Hsap_test1.spaln.gff3

Thank you.

ogotoh commented 3 years ago

Dear Fangdm,

Thank you for your comment.

Because of its intrinsic algorithmic nature, Spaln is rather poor at mapping short queries; Spaln is tuned to map relatively long queries, so that short queries are often escape from detection. I consider it difficult to run Spaln at the present speed for all range of query lengths. If your queries consist of only short sequences, it might deserve to try either or both of the following options: -XS // examine all blocks with positive block score -Xfn // n (0.1 < n < 0.4) : a factor for controlling threshold (minimum) block score For example, assuming that fasta files of ENSP00000248150, ENSP00000480507, … are present in the current directory:

spaln -O0 -Q7 -pw -T homosapi -t12 -M4 -dHsap -XS -Xf0.2 ENSP00000248150, ENSP00000480507, … > Out.gff3

will retry to map the failed queries. Note that I have not yet fully tested these options. It is expected that these options will considerably slowdown the execution rate. Additionally, the -XS option has been disabled in recent releases. I uploaded today a new release (Ver.2.4.4) that restores this option (formally -S instead of -XS. -S option is now used for another purpose).

In a future release, I would like to make Spaln automatically adjust proper options depending on the query length.

Osamu,

fangdm commented 3 years ago

Hi, thanks for your reply and updates.

I've tried as you suggested with Ver2.4.4, but I get short genes with false coordinates, whether the short genes with short exons and/or long introns caused higher false positives. I've also tried changing -Xf 0.1 to -Xf 0.4, but it actually produced same result (no result when -Xf 0.1). Were there other parameters that I can do to improve the result for short genes?

Meanwhile, I had tried to annotate my de novo genome with spaln, but I cann’t find a close relative species in table/gnm2tab for -T parameter, how to generate my own species-specific subdirectory with the published relative species.

Thank you.

ogotoh commented 3 years ago

Thank you for your rapid response. I was not much surprised with your negative report. Probably more fundamental change in the Spaln’s algorithm would be necessary to properly deal with short queries. For a while, I suggest you to use Tblastn or Blat to find relevant genomic loci, then apply Spaln to the found loci in the alignment only mode (-Q0-3). Spaln -O0 -Q3 -pw -T homosapi -dHsap ‘$chromesome left right’ query

To generate a species-specific parameter set, we need a considerable number (at least 1000) of transcript (cDNA, EST, TSA (Transcriptome Shotgun Assembly)) sequences, together with the genomic sequence, of that species. Then, 1) Run Spaln with the default parameter set with -O12 (binary output). 2) Run Sortgrcd with -O15 -F2 options (collect high-quality splice junction pairs). 3) Run a set of scripts to generate several files that contain splice junction signals, intron length distribution, and if necessary, other information/signals. 4) Store thus generated files in a subdirectory, ‘tmp’. 5) Repeat once more 1-4) using the -T tmp option.

The third part is somewhat messy and I have not enough time to write a manual. If you tell me the public site of the genomic and transcript sequences of the relative species, I will make the parameter set of that species.

Osamu,

fangdm commented 3 years ago

Hi, many thanks for your kind help.

I've tried as you suggested with the commands: Spaln -O0 -Q3 -pw -T homosapi -dHsap ‘$chromesome left right’ query. It kept reporting the errors: ‘$chromesome left right’ is not found!

But I'm very sure there was corresponding chromosomes in my genome sequence. And I used another way that I extracted the sequences near the target gene, and successfully predicted the corresponding genes. For this way, in my project, it also had great difficulties if the prediction with tblastn or blat were not accurate. So, I’m looking forward to your subsequent updates for short genes.

For a species-specific parameter set, for me, it's a complex process that involves a series of steps, and it takes some experience to correct it. I'll trouble you later, HaHa.

Anyway, thanks again for your help!