Closed Secretloong closed 4 years ago
Dear Secretloong,
I could not reproduce your claim on your test pair of sequences. Please let me know more details of your commands. I don’t think that the unidirectional Hirschberg algorithm is responsible for the defect. Genes can be missed when the alignment score doesn’t exceed a given threshold. This most frequently occurs when the reading frame is destroyed for some reasons. You can check this situation by running spaln with –O1 –pw options.
Osamu,
ps. According to the advice of our institute regarding to the corona virus, I come to my office only one or two days per week. Please forgive me if my response may be retarded.
Sorry for this delay.
Here are my commands, I am not sure why their results are different in my computer:
##spaln2.4.03
export PATH="/usr/spaln2.4.03:/usr/spaln2.4.03/perl/:${PATH}" && export ALN_DBS=/data/spaln2/URILOM.NOXD01000005.1 && cd /data/spaln2/URILOM.NOXD01000005.1 && perl /usr/spaln2.4.03/perl/makeidx.pl -ip -XG802338 -Xb34000 /data/spaln2/URILOM.NOXD01000005.1/URILOM.mfa
export PATH="/usr/spaln2.4.03:/usr/spaln2.4.03/perl/:${PATH}" && export ALN_TAB=/usr/spaln2.4.03/table && export ALN_DBS=/data/spaln2/URILOM.NOXD01000005.1 && spaln -O0 -Q7 -pw -TTetrapod -M50 -dURILOM /data/ensembl.clean.pep > /data/spaln2/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.gff 2>/data/spaln2/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.log
##spaln2.4.03
export PATH="/usr/spaln2.3.3c:/usr/spaln2.3.3c/perl/:${PATH}" && export ALN_DBS=/data/spaln/URILOM.NOXD01000005.1 && cd /data/spaln/URILOM.NOXD01000005.1 && perl /usr/spaln2.3.3c/perl/makeidx.pl -ip -XG802338 -Xb34000 /data/spaln/URILOM.NOXD01000005.1/URILOM.mfa
export PATH="/usr/spaln2.3.3c:/usr/spaln2.3.3c/perl/:${PATH}" && export ALN_TAB=/usr/spaln2.3.3c/table && export ALN_DBS=/data/spaln/URILOM.NOXD01000005.1 && spaln -O0 -Q7 -pw -yS -TTetrapod -M50 -dURILOM /data/ensembl.clean.pep > /data/spaln/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.gff 2>/data/spaln/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.log
I also test –O1 –pw options, but they don't work. How could I do other tests to help you find the probelm? Be free to tell me.
PS, everything will be OK, がんばってください~~
Dear Secretloong,
Your commands seems OK. I can't imagine what is wrong, as your test sequences are readily aligned in the local mode: $ spaln [-Q3] test.mfa test.pep
Could you try the following command, and see what happens?
$ spaln -Q4 -O1 -pw -d URILOM -T Tetrapod test.pep
Osamu,
ps. The situation of Tokyo is getting worse. The mayor announced last night that the city might be locked out if the number of infected persons increases more than the present rate.
差出人: Secretloong notifications@github.com 送信日r: 2020年3月24日 11:24 宛先: ogotoh/spaln spaln@noreply.github.com CC: 後藤修 o.gotoh@aist.go.jp; Comment comment@noreply.github.com 件名: Re: [ogotoh/spaln] New algorithm to 2.4.03 missed some gene annotations (#27)
Sorry for this delay.
Here are my commands, I am not sure why their results are different in my computer:
export PATH="/usr/spaln2.4.03:/usr/spaln2.4.03/perl/:${PATH}" && export ALN_DBS=/data/spaln2/URILOM.NOXD01000005.1 && cd /data/spaln2/URILOM.NOXD01000005.1 && perl /usr/spaln2.4.03/perl/makeidx.pl -ip -XG802338 -Xb34000 /data/spaln2/URILOM.NOXD01000005.1/URILOM.mfa
export PATH="/usr/spaln2.4.03:/usr/spaln2.4.03/perl/:${PATH}" && export ALN_TAB=/usr/spaln2.4.03/table && export ALN_DBS=/data/spaln2/URILOM.NOXD01000005.1 && spaln -O0 -Q7 -pw -TTetrapod -M50 -dURILOM /data/ensembl.clean.pep > /data/spaln2/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.gff 2>/data/spaln2/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.log
export PATH="/usr/spaln2.3.3c:/usr/spaln2.3.3c/perl/:${PATH}" && export ALN_DBS=/data/spaln/URILOM.NOXD01000005.1 && cd /data/spaln/URILOM.NOXD01000005.1 && perl /usr/spaln2.3.3c/perl/makeidx.pl -ip -XG802338 -Xb34000 /data/spaln/URILOM.NOXD01000005.1/URILOM.mfa
export PATH="/usr/spaln2.3.3c:/usr/spaln2.3.3c/perl/:${PATH}" && export ALN_TAB=/usr/spaln2.3.3c/table && export ALN_DBS=/data/spaln/URILOM.NOXD01000005.1 && spaln -O0 -Q7 -pw -yS -TTetrapod -M50 -dURILOM /data/ensembl.clean.pep > /data/spaln/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.gff 2>/data/spaln/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.log
I also test CO1 Cpw options, but they don't work. How could I do other tests to help you find the probelm? Be free to tell me.
PS, everything will be OK, がんばってください~~
― You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fogotoh%2Fspaln%2Fissues%2F27%23issuecomment-602970893&data=02%7C01%7Co.gotoh%40aist.go.jp%7C388cdc8107f44cc8dedd08d7cf9a7ef4%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637206134779052897&sdata=ccAudzvs3iASMvlpzM59DegIaeNc9r4xwD3JmMOs0C4%3D&reserved=0, or unsubscribehttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH6C4LUPZ3SSZTQZ64I5ILLRJAKWDANCNFSM4LMAS2JA&data=02%7C01%7Co.gotoh%40aist.go.jp%7C388cdc8107f44cc8dedd08d7cf9a7ef4%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637206134779062891&sdata=Vzj1EA0hppLlDZxTqrgMIzFd1WbF33HiV3zoyz2xCkk%3D&reserved=0.
Hi Gotoh,
Sorry for my slowness again.
I found the issue might happen during the index making.
Log for the 2.3.3c, and got the URILOM.bkp
27M
makblk.pl -WURILOM.bkp -XG802338 -Xb34000 URILOM.mfa
spaln -WURILOM.bkp -Xk5 -Xb34000 -XG802338 -Xg32 -KP -Xs5 URILOM.mfa
Gd:736891 No:2462760 My:349 MS:0 Mb:29 Tw:1162406 Tl:5554459 23.03 0.01 76.96
#Segs 104, TabSize 3200000, Words: 1162406, GenomeSize 3513771, GIDs 1
Log for the 2.4.03, and got the URILOM.bkp
23M
makblk.pl -WURILOM.bkp -XG802338 -Xb34000 URILOM.mfa
spaln -WURILOM.bkp -Xk5 -Xb34000 -XG802338 -Xg32 -KP -Xs5 URILOM.mfa
Gd:1153987 No:735339 My:242 MS:0 Mb:65535 Tw:4029506 Tl:22131642 61.07 0.01 38.92
#Segs 104, TabSize 1889568, Words: 4029506, GenomeSize 3513771, GIDs 1
Could you find anything wrong with it?
Thanks
I am still basically stay at home. Sorry for the delay in response.
I think the block size defined by -Xb34000 seems too large compared with the genome size of 3.5M.
Please try the default setting which is valid since ver2.4.0
$ spaln -W -KP URILOM.mfa
Osamu,
差出人: Qi Fang notifications@github.com 送信日時: 2020年4月17日 12:08 宛先: ogotoh/spaln spaln@noreply.github.com CC: 後藤修 o.gotoh@aist.go.jp; Comment comment@noreply.github.com 件名: Re: [ogotoh/spaln] New algorithm to 2.4.03 missed some gene annotations (#27)
Hi Gotoh,
Sorry for my slowness again.
I found the issue might happen during the index making.
Log for the 2.3.3c, and got the URILOM.bkp 27M
makblk.pl -WURILOM.bkp -XG802338 -Xb34000 URILOM.mfa spaln -WURILOM.bkp -Xk5 -Xb34000 -XG802338 -Xg32 -KP -Xs5 URILOM.mfa Gd:736891 No:2462760 My:349 MS:0 Mb:29 Tw:1162406 Tl:5554459 23.03 0.01 76.96
Log for the 2.4.03, and got the URILOM.bkp 23M
makblk.pl -WURILOM.bkp -XG802338 -Xb34000 URILOM.mfa spaln -WURILOM.bkp -Xk5 -Xb34000 -XG802338 -Xg32 -KP -Xs5 URILOM.mfa Gd:1153987 No:735339 My:242 MS:0 Mb:65535 Tw:4029506 Tl:22131642 61.07 0.01 38.92
Could you find anything wrong with it?
Thanks
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fogotoh%2Fspaln%2Fissues%2F27%23issuecomment-615012958&data=02%7C01%7Co.gotoh%40aist.go.jp%7Cc6c98b9095f74068af9308d7e27c9422%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637226897419742135&sdata=2GY0yQVMAumJ%2Bg7V7CLtiYZzEKU8eRMfmjj2FzvdX30%3D&reserved=0, or unsubscribehttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH6C4LQGLNZYKIHPDKAM5FDRM7B2BANCNFSM4LMAS2JA&data=02%7C01%7Co.gotoh%40aist.go.jp%7Cc6c98b9095f74068af9308d7e27c9422%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637226897419742135&sdata=2zubGXuaQ5f%2BJzx%2BvvE%2BdwENQK0B5OGZW32vGvggBoI%3D&reserved=0.
Thank you very much, Osamu.
It works now. The -Xb34000
, which I used to annotated the whole genome before, may be inadequate to the single chromosome annotation. The default setting with inferring from genome size is great and easy to use now. I am planing only use the -XG
to specify the max gene size, then parallel each chromosome or scaffold.
While I am still curious to know why spaln2 3.3c
could find the gene with inadequate block size? Do you know how it happened?
Thanks
When I used -XG802338 -Xb34000
to annotate whole genome (1.2GB), spaln2.4.03
could also indentify the target gene as the spaln2.3.3c
did. By your definition
a block is a predefined segment of the genome with a fixed length. (Gotoh, O. (2008). "A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence." Nucleic Acids Res 36(8): 2630-2638.)
I think the blocks would be the same (with the same -Xb
setting) between the whole genome and single chromosome. Then get the same annotations, but now whole genome input would identify the gene which could not be found with single chromosome as input. Do I misunderstand the blocks?
Thanks
Dear Fang,
I think the blocks would be the same (with the same -Xb setting) between the whole genome and single chromosome.
While I am still curious to know why spaln2 3.3c could find the gene with inadequate block size? Do you know how it happened?
No, block size should be adjusted according to the input genomic or chromosomal sequence length. As described in the NAR paper, an intrinsic assumption of Spaln is that the probability of hit of each k-mer to each block should be << 1.0. If the block size is large compared with the input sequence length, that assumption would be violated. It is hard to anticipate what happens in such cases.
—
I am planning only use the -XG to specify the max gene size, then parallel each chromosome or scaffold.
I didn’t expect for spaln to be used in such a way, but instead to be used as either a whole genome mapper (-Q4-7) or a spliced aligner of a short genomic segment that contains one or only a few genes (-Q0-3). I am not sure about your plan, but generally speaking, I recommend to use spaln in the whole genome mapping mode with multi-thread option (-tN). If necessary, you may run several spaln in parallel with a set of queries divided into chunks.
Osamu,
差出人: Qi Fang notifications@github.com 送信日時: 2020年4月26日 16:53 宛先: ogotoh/spaln spaln@noreply.github.com CC: 後藤修 o.gotoh@aist.go.jp; Comment comment@noreply.github.com 件名: Re: [ogotoh/spaln] New algorithm to 2.4.03 missed some gene annotations (#27)
When I used -XG802338 -Xb34000 to annotate whole genome (1.2GB), spaln2.4.03 could also indentify the target gene as the spaln2.3.3c did. By your definition
a block is a predefined segment of the genome with a fixed length. (Gotoh, O. (2008). "A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence." Nucleic Acids Res 36(8): 2630-2638.) I think the blocks would be the same (with the same -Xb setting) between the whole genome and single chromosome. Then get the same annotations, but now whole genome input would identify the gene which could not be found with single chromosome as input. Do I misunderstand the blocks?
Thanks
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fogotoh%2Fspaln%2Fissues%2F27%23issuecomment-619504244&data=02%7C01%7Co.gotoh%40aist.go.jp%7C4fea03b98f6a40f966d708d7e9b6ed29%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637234844179713368&sdata=IupxlhCBf1fk8mluhj7txqQ6zU48M6HRFeae9AoIuzo%3D&reserved=0, or unsubscribehttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH6C4LTO5KIKUVUCFZ2DH6DROPR75ANCNFSM4LMAS2JA&data=02%7C01%7Co.gotoh%40aist.go.jp%7C4fea03b98f6a40f966d708d7e9b6ed29%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637234844179713368&sdata=0AFHwzCYY1T%2FG8ndEpboIpIHOs7wHUtJnnqj6lLgLOM%3D&reserved=0.
Many thanks. I really appreciate your answers.
Hi Gotoh,
I found some gene could be annotated by the 2.3.3c, but not by 2.4.03. And these genes have great gene structrue by manual check. I am afraid there is some missing in the unidirectional algorithm. I also attach the test files, so you could confirm it.
test.tar.gz