ogotoh / spaln

Genome mapping and spliced alignment of cDNA or amino acid sequences
GNU General Public License v2.0
94 stars 16 forks source link

New algorithm to 2.4.03 missed some gene annotations #27

Closed Secretloong closed 4 years ago

Secretloong commented 4 years ago

Hi Gotoh,

I found some gene could be annotated by the 2.3.3c, but not by 2.4.03. And these genes have great gene structrue by manual check. I am afraid there is some missing in the unidirectional algorithm. I also attach the test files, so you could confirm it.

test.tar.gz

ogotoh commented 4 years ago

Dear Secretloong,

I could not reproduce your claim on your test pair of sequences. Please let me know more details of your commands. I don’t think that the unidirectional Hirschberg algorithm is responsible for the defect. Genes can be missed when the alignment score doesn’t exceed a given threshold. This most frequently occurs when the reading frame is destroyed for some reasons. You can check this situation by running spaln with –O1 –pw options.

Osamu,

ps. According to the advice of our institute regarding to the corona virus, I come to my office only one or two days per week. Please forgive me if my response may be retarded.

Secretloong commented 4 years ago

Sorry for this delay.

Here are my commands, I am not sure why their results are different in my computer:

##spaln2.4.03
export PATH="/usr/spaln2.4.03:/usr/spaln2.4.03/perl/:${PATH}" && export ALN_DBS=/data/spaln2/URILOM.NOXD01000005.1 && cd /data/spaln2/URILOM.NOXD01000005.1 && perl /usr/spaln2.4.03/perl/makeidx.pl -ip -XG802338 -Xb34000 /data/spaln2/URILOM.NOXD01000005.1/URILOM.mfa
export PATH="/usr/spaln2.4.03:/usr/spaln2.4.03/perl/:${PATH}" && export ALN_TAB=/usr/spaln2.4.03/table && export ALN_DBS=/data/spaln2/URILOM.NOXD01000005.1 && spaln -O0 -Q7 -pw -TTetrapod -M50 -dURILOM /data/ensembl.clean.pep > /data/spaln2/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.gff 2>/data/spaln2/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.log

##spaln2.4.03
export PATH="/usr/spaln2.3.3c:/usr/spaln2.3.3c/perl/:${PATH}" && export ALN_DBS=/data/spaln/URILOM.NOXD01000005.1 && cd /data/spaln/URILOM.NOXD01000005.1 && perl /usr/spaln2.3.3c/perl/makeidx.pl -ip -XG802338 -Xb34000 /data/spaln/URILOM.NOXD01000005.1/URILOM.mfa
export PATH="/usr/spaln2.3.3c:/usr/spaln2.3.3c/perl/:${PATH}" && export ALN_TAB=/usr/spaln2.3.3c/table && export ALN_DBS=/data/spaln/URILOM.NOXD01000005.1 && spaln -O0 -Q7 -pw -yS -TTetrapod -M50 -dURILOM /data/ensembl.clean.pep > /data/spaln/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.gff 2>/data/spaln/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.log

I also test –O1 –pw options, but they don't work. How could I do other tests to help you find the probelm? Be free to tell me.

PS, everything will be OK, がんばってください~~

ogotoh commented 4 years ago

Dear Secretloong,

Your commands seems OK. I can't imagine what is wrong, as your test sequences are readily aligned in the local mode: $ spaln [-Q3] test.mfa test.pep

Could you try the following command, and see what happens?

$ spaln -Q4 -O1 -pw -d URILOM -T Tetrapod test.pep

Osamu,

ps. The situation of Tokyo is getting worse. The mayor announced last night that the city might be locked out if the number of infected persons increases more than the present rate.


差出人: Secretloong notifications@github.com 送信日•r: 2020年3月24日 11:24 宛先: ogotoh/spaln spaln@noreply.github.com CC: 後藤修 o.gotoh@aist.go.jp; Comment comment@noreply.github.com 件名: Re: [ogotoh/spaln] New algorithm to 2.4.03 missed some gene annotations (#27)

Sorry for this delay.

Here are my commands, I am not sure why their results are different in my computer:

spaln2.4.03

export PATH="/usr/spaln2.4.03:/usr/spaln2.4.03/perl/:${PATH}" && export ALN_DBS=/data/spaln2/URILOM.NOXD01000005.1 && cd /data/spaln2/URILOM.NOXD01000005.1 && perl /usr/spaln2.4.03/perl/makeidx.pl -ip -XG802338 -Xb34000 /data/spaln2/URILOM.NOXD01000005.1/URILOM.mfa

export PATH="/usr/spaln2.4.03:/usr/spaln2.4.03/perl/:${PATH}" && export ALN_TAB=/usr/spaln2.4.03/table && export ALN_DBS=/data/spaln2/URILOM.NOXD01000005.1 && spaln -O0 -Q7 -pw -TTetrapod -M50 -dURILOM /data/ensembl.clean.pep > /data/spaln2/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.gff 2>/data/spaln2/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.log

spaln2.4.03

export PATH="/usr/spaln2.3.3c:/usr/spaln2.3.3c/perl/:${PATH}" && export ALN_DBS=/data/spaln/URILOM.NOXD01000005.1 && cd /data/spaln/URILOM.NOXD01000005.1 && perl /usr/spaln2.3.3c/perl/makeidx.pl -ip -XG802338 -Xb34000 /data/spaln/URILOM.NOXD01000005.1/URILOM.mfa

export PATH="/usr/spaln2.3.3c:/usr/spaln2.3.3c/perl/:${PATH}" && export ALN_TAB=/usr/spaln2.3.3c/table && export ALN_DBS=/data/spaln/URILOM.NOXD01000005.1 && spaln -O0 -Q7 -pw -yS -TTetrapod -M50 -dURILOM /data/ensembl.clean.pep > /data/spaln/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.gff 2>/data/spaln/URILOM.NOXD01000005.1/ensembl.clean.pep.spaln.log

I also test CO1 Cpw options, but they don't work. How could I do other tests to help you find the probelm? Be free to tell me.

PS, everything will be OK, がんばってください~~

― You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fogotoh%2Fspaln%2Fissues%2F27%23issuecomment-602970893&data=02%7C01%7Co.gotoh%40aist.go.jp%7C388cdc8107f44cc8dedd08d7cf9a7ef4%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637206134779052897&sdata=ccAudzvs3iASMvlpzM59DegIaeNc9r4xwD3JmMOs0C4%3D&reserved=0, or unsubscribehttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH6C4LUPZ3SSZTQZ64I5ILLRJAKWDANCNFSM4LMAS2JA&data=02%7C01%7Co.gotoh%40aist.go.jp%7C388cdc8107f44cc8dedd08d7cf9a7ef4%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637206134779062891&sdata=Vzj1EA0hppLlDZxTqrgMIzFd1WbF33HiV3zoyz2xCkk%3D&reserved=0.

Secretloong commented 4 years ago

Hi Gotoh,

Sorry for my slowness again.

I found the issue might happen during the index making.

Log for the 2.3.3c, and got the URILOM.bkp 27M

makblk.pl -WURILOM.bkp -XG802338 -Xb34000  URILOM.mfa
spaln -WURILOM.bkp -Xk5 -Xb34000 -XG802338 -Xg32 -KP -Xs5 URILOM.mfa
Gd:736891 No:2462760 My:349 MS:0 Mb:29 Tw:1162406 Tl:5554459  23.03   0.01  76.96
#Segs 104, TabSize 3200000, Words: 1162406, GenomeSize 3513771, GIDs 1

Log for the 2.4.03, and got the URILOM.bkp 23M

makblk.pl -WURILOM.bkp -XG802338 -Xb34000  URILOM.mfa
spaln -WURILOM.bkp -Xk5 -Xb34000 -XG802338 -Xg32 -KP -Xs5 URILOM.mfa
Gd:1153987 No:735339 My:242 MS:0 Mb:65535 Tw:4029506 Tl:22131642  61.07   0.01  38.92
#Segs 104, TabSize 1889568, Words: 4029506, GenomeSize 3513771, GIDs 1

Could you find anything wrong with it?

Thanks

ogotoh commented 4 years ago

I am still basically stay at home. Sorry for the delay in response.

I think the block size defined by -Xb34000 seems too large compared with the genome size of 3.5M.

Please try the default setting which is valid since ver2.4.0

$ spaln -W -KP URILOM.mfa

Osamu,


差出人: Qi Fang notifications@github.com 送信日時: 2020年4月17日 12:08 宛先: ogotoh/spaln spaln@noreply.github.com CC: 後藤修 o.gotoh@aist.go.jp; Comment comment@noreply.github.com 件名: Re: [ogotoh/spaln] New algorithm to 2.4.03 missed some gene annotations (#27)

Hi Gotoh,

Sorry for my slowness again.

I found the issue might happen during the index making.

Log for the 2.3.3c, and got the URILOM.bkp 27M

makblk.pl -WURILOM.bkp -XG802338 -Xb34000 URILOM.mfa spaln -WURILOM.bkp -Xk5 -Xb34000 -XG802338 -Xg32 -KP -Xs5 URILOM.mfa Gd:736891 No:2462760 My:349 MS:0 Mb:29 Tw:1162406 Tl:5554459 23.03 0.01 76.96

Segs 104, TabSize 3200000, Words: 1162406, GenomeSize 3513771, GIDs 1

Log for the 2.4.03, and got the URILOM.bkp 23M

makblk.pl -WURILOM.bkp -XG802338 -Xb34000 URILOM.mfa spaln -WURILOM.bkp -Xk5 -Xb34000 -XG802338 -Xg32 -KP -Xs5 URILOM.mfa Gd:1153987 No:735339 My:242 MS:0 Mb:65535 Tw:4029506 Tl:22131642 61.07 0.01 38.92

Segs 104, TabSize 1889568, Words: 4029506, GenomeSize 3513771, GIDs 1

Could you find anything wrong with it?

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fogotoh%2Fspaln%2Fissues%2F27%23issuecomment-615012958&data=02%7C01%7Co.gotoh%40aist.go.jp%7Cc6c98b9095f74068af9308d7e27c9422%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637226897419742135&sdata=2GY0yQVMAumJ%2Bg7V7CLtiYZzEKU8eRMfmjj2FzvdX30%3D&reserved=0, or unsubscribehttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH6C4LQGLNZYKIHPDKAM5FDRM7B2BANCNFSM4LMAS2JA&data=02%7C01%7Co.gotoh%40aist.go.jp%7Cc6c98b9095f74068af9308d7e27c9422%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637226897419742135&sdata=2zubGXuaQ5f%2BJzx%2BvvE%2BdwENQK0B5OGZW32vGvggBoI%3D&reserved=0.

Secretloong commented 4 years ago

Thank you very much, Osamu.

It works now. The -Xb34000, which I used to annotated the whole genome before, may be inadequate to the single chromosome annotation. The default setting with inferring from genome size is great and easy to use now. I am planing only use the -XG to specify the max gene size, then parallel each chromosome or scaffold.

While I am still curious to know why spaln2 3.3c could find the gene with inadequate block size? Do you know how it happened?

Thanks

Secretloong commented 4 years ago

When I used -XG802338 -Xb34000 to annotate whole genome (1.2GB), spaln2.4.03 could also indentify the target gene as the spaln2.3.3c did. By your definition

a block is a predefined segment of the genome with a fixed length. (Gotoh, O. (2008). "A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence." Nucleic Acids Res 36(8): 2630-2638.)

I think the blocks would be the same (with the same -Xb setting) between the whole genome and single chromosome. Then get the same annotations, but now whole genome input would identify the gene which could not be found with single chromosome as input. Do I misunderstand the blocks?

Thanks

ogotoh commented 4 years ago

Dear Fang,

I think the blocks would be the same (with the same -Xb setting) between the whole genome and single chromosome.

While I am still curious to know why spaln2 3.3c could find the gene with inadequate block size? Do you know how it happened?

No, block size should be adjusted according to the input genomic or chromosomal sequence length. As described in the NAR paper, an intrinsic assumption of Spaln is that the probability of hit of each k-mer to each block should be << 1.0. If the block size is large compared with the input sequence length, that assumption would be violated. It is hard to anticipate what happens in such cases.

I am planning only use the -XG to specify the max gene size, then parallel each chromosome or scaffold.

I didn’t expect for spaln to be used in such a way, but instead to be used as either a whole genome mapper (-Q4-7) or a spliced aligner of a short genomic segment that contains one or only a few genes (-Q0-3). I am not sure about your plan, but generally speaking, I recommend to use spaln in the whole genome mapping mode with multi-thread option (-tN). If necessary, you may run several spaln in parallel with a set of queries divided into chunks.

Osamu,


差出人: Qi Fang notifications@github.com 送信日時: 2020年4月26日 16:53 宛先: ogotoh/spaln spaln@noreply.github.com CC: 後藤修 o.gotoh@aist.go.jp; Comment comment@noreply.github.com 件名: Re: [ogotoh/spaln] New algorithm to 2.4.03 missed some gene annotations (#27)

When I used -XG802338 -Xb34000 to annotate whole genome (1.2GB), spaln2.4.03 could also indentify the target gene as the spaln2.3.3c did. By your definition

a block is a predefined segment of the genome with a fixed length. (Gotoh, O. (2008). "A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence." Nucleic Acids Res 36(8): 2630-2638.) I think the blocks would be the same (with the same -Xb setting) between the whole genome and single chromosome. Then get the same annotations, but now whole genome input would identify the gene which could not be found with single chromosome as input. Do I misunderstand the blocks?

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fogotoh%2Fspaln%2Fissues%2F27%23issuecomment-619504244&data=02%7C01%7Co.gotoh%40aist.go.jp%7C4fea03b98f6a40f966d708d7e9b6ed29%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637234844179713368&sdata=IupxlhCBf1fk8mluhj7txqQ6zU48M6HRFeae9AoIuzo%3D&reserved=0, or unsubscribehttps://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH6C4LTO5KIKUVUCFZ2DH6DROPR75ANCNFSM4LMAS2JA&data=02%7C01%7Co.gotoh%40aist.go.jp%7C4fea03b98f6a40f966d708d7e9b6ed29%7C18a7fec8652f409b8369272d9ce80620%7C0%7C0%7C637234844179713368&sdata=0AFHwzCYY1T%2FG8ndEpboIpIHOs7wHUtJnnqj6lLgLOM%3D&reserved=0.

Secretloong commented 4 years ago

Many thanks. I really appreciate your answers.