ogotoh / spaln

Genome mapping and spliced alignment of cDNA or amino acid sequences
GNU General Public License v2.0
94 stars 16 forks source link

About the output by using -O1 #22

Closed ljl000 closed 5 years ago

ljl000 commented 5 years ago

Hi, ogotoh I got a question when I use spaln to align my protein sequences to genome sequences by using the -O1 option. The result just like this

屏幕快照 2019-09-04 下午1 09 22

so,I want to know if the M(=)50 means the identities between the query and the genome sequences.Why does the result showed seem to be aligned just well enough? Thanks a lot!

ogotoh commented 5 years ago

Yes, M means the identities between the query and the genome sequences. You may simply count the number of columns for which the translated residue in the first line ('J' should be read as 'S') is identical with the query residue in the third line in each alignment block. I made a small mistake in my previous response. Actually, P = 100 M / (M + N + U) rather than P = 100 M / (M + N + G) . So, you might get the impression that P is smaller than that intuitively expected from the alignment when it contains several long gaps.

Osamu,

ljl000 commented 5 years ago

I know that (=) numbers means the Identical residues between the query and the genome sequences. Just like this屏幕快照 2019-09-05 下午12 26 29. But My question is how to explain the residues in the first line which do not same as the query residues in the third line, just like this
屏幕快照 2019-09-05 下午12 30 43,I mean what does these pair of the residues mean? It can't be explain simple as identical residues, can I treat these as positive matches just like performing the Tblastn? Look forward for your comment! Just thank you again!

ogotoh commented 5 years ago

Remember that blast (including tblastn) is a local alignment tool whereas spaln generates semi-global alignment (unless you set –LS option), implying that it try to align all the query residues to a specific range of genomic sequence. If the query is not the direct product of the gene in the genomic sequence but a product of homologous (paralogous or orthologous) gene of the same or other species, it is quite general to observe such mismatched pairs in the alignment. Is this the answer you expect or do you want to ask something else?

Osamu

ljl000 commented 5 years ago

Yes, my query is exactly not the direct product of the gene in the genomic sequences but a product of homology of other species. So, performing this kind of alignment, you recommend to set the -LS option? Or you got some other advice?

ogotoh commented 5 years ago

The alignment shown above looks fine, suggesting that spaln cached correct gene structure. However, this gene appears to be intronless and so relatively easy to predict. I usually use -LS option for mapping cDNA (EST in particular) sequences but rarely use for mapping protein sequences.

ljl000 commented 5 years ago

I just paste the good result above, but the other results always seems not just well enough. Because the results now often be fragmentized. How can I improve the performance to mapping homology protein sequences to genomic sequences ? If there any advice you may propose?

ogotoh commented 5 years ago

Simply your reference protein sequences and the target genomic sequence appear to be too remote to be faithfully aligned with spaln. Please refer to my original paper in Bioinformatics (2008) to get rough estimate of limitation of spaln, although the limitation will considerably varies with genome size, sequence quality, intron density, and so on. Although not extensively examined by myself, one potential solution is to use spaln in protein database search mode: spaln -Q7 -a prodb [-M_N_] [other options] genomic_segment where probdb might be SwissProt or other protein sequence database pre-formatted with makeidx.pl -a, and genomic_segment is a segment of your genome which may encode one or several genes.

ljl000 commented 5 years ago

Well, I think you may misunderstanding my question, my query is protein sequences and my reference sequences are the genomic sequence data. So, when you suggest performing the spaln -Q7 -a prodb [-MN] [other options] genomic_segment options, you actually means spaln -Q7 -d xxxgnm protein_sequence. Am I right?

ogotoh commented 5 years ago

Sorry for the delay in response. I was off from my office until this morning.

I don't know exactly your situation, so I suggested a potentially alternative way to use spaln.

I guess you are trying to solve a difficult gene annotation problem in which no close transcript reference sequences are available. What protein sequences are you using as the references? If you find good tblastn hits but fail to find good spaln hists, you may use the alignment-only mode of spaln as: spaln -Q[0-3] -d your_genome -O1 -T table '$chromose/contig_id from to [<]' reference_aa where from and to refer to the range of potential gene region on the chromosome, and optional '<' means that the gene resides on the reverse strand. However, when the reference and the target genome are evolutionarily distant, reliable gene structure prediction is difficult, as I said the other day.


差出人: ljl000 notifications@github.com 送信日時: 2019年9月6日 20:15 宛先: ogotoh/spaln spaln@noreply.github.com CC: 後藤修 o.gotoh@aist.go.jp; Comment comment@noreply.github.com 件名: Re: [ogotoh/spaln] About the output by using -O1 (#22)

Well, I think you may misunderstanding my question, my query is protein sequences and my reference sequences are the genomic sequence data. So, when you suggest performing the spaln -Q7 -a prodb [-MN] [other options] genomic_segment options, you actually means spaln -Q7 -d xxxgnm protein_sequence. Am I right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ogotoh/spaln/issues/22?email_source=notifications&email_token=AH6C4LQ54OGMGX5XYFNOBRDQII3WXA5CNFSM4ITNP7LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CRETI#issuecomment-528814669, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH6C4LQGMTO6VEFH5DMRQ4LQII3WXANCNFSM4ITNP7LA.

ljl000 commented 5 years ago

Thanks for your comment. Indeed my situation is not so easy to describe, but you really help me a lot about using spaln to solve my problems. I've sorted it out. Thanks again!