ogotoh / spaln

Genome mapping and spliced alignment of cDNA or amino acid sequences
GNU General Public License v2.0
95 stars 16 forks source link

spaln v3 instruction to align protein sequences to a genome #77

Open ohdongha opened 5 days ago

ohdongha commented 5 days ago

Dear Author, Thank you for developing and updating an aligner capable of aligning proteins to a genome on a large scale.

I want to align protein sequences collected from species in the same order (file name aa_1.fa with 25,000 protein sequences) to a genome sequence (file name gnm.gf) so that the spliced alignments can guide the genome annotation process.

I installed the version 3.0.6a <240916> and below is what I tried and the messages printed to the screen:

$ spaln -W -KP gnm.gf
Gd:19306111 No:44658087 My:35802 MS:385 Mb:260 Tw:37711716 Tl:287737303  30.17   0.06  69.78
#Segs 13684, TabSize 64000000, Words: 37711716, GenomeSize 194452999, GIDs 246

$ spaln -Q0 -A0 -O10 -t8 -o gnm_aa_1.sam gnm.bkp aa_1.fa
(... printed some amino acid sequences on the screen ...)

$ column -t -s$'\t' gnm_aa_1.sam
XP_008552390.2   0  gnm.bkp  6241  69   47M3I17M2I7M85H           *  0  0    *  NM:i:64   AS:i:37
XP_011135446.1   0  gnm.bkp  5350  121  55M1D27M                  *  0  0    *  NM:i:75   AS:i:36
XP_026827363.1   0  gnm.bkp  3225  111  44M7I15M2D16M4D47M        *  0  0    *  NM:i:105  AS:i:47
NP_722687.1      0  gnm.bkp  3982  119  13M1I58M5D9M1I31M3I23M    *  0  0    *  NM:i:124  AS:i:37
NP_524299.1      0  gnm.bkp  3757  110  10M1D5M1D25M              *  0  0    *  NM:i:32   AS:i:50
XP_023317772.1   0  gnm.bkp  4948  126  129M1I39M3D40M            *  0  0    *  NM:i:202  AS:i:41
XP_008556624.1   0  gnm.bkp  3199  113  30M8D48M                  *  0  0    *  NM:i:65   AS:i:47
XP_058807296.1   0  gnm.bkp  5057  127  37M3I40M1D39M1D72M1D128M  *  0  0    *  NM:i:311  AS:i:40
NP_001262532.1   0  gnm.bkp  3196  120  41M4D42M                  *  0  0    *  NM:i:75   AS:i:35
XP_015509306.1   0  gnm.bkp  3302  112  7M1I80M                   *  0  0    *  NM:i:73   AS:i:55
NP_001036778.1   0  gnm.bkp  3191  106  15M1I31M                  *  0  0    *  NM:i:36   AS:i:44
XP_047359205.1   0  gnm.bkp  3225  118  34M4I40M5D28M             *  0  0    *  NM:i:94   AS:i:37
XP_048260883.1   0  gnm.bkp  3611  121  18M1I28M2D47M2I14M1I18M   *  0  0    *  NM:i:118  AS:i:38
XP_003707593.1   0  gnm.bkp  3200  111  37M2I46M13I5M7I58M6I13M   *  0  0    *  NM:i:152  AS:i:35

The run finished within a few minutes, and the output was much smaller than I expected.

I have these questions:

Usage: spaln -W[Genome.bkn] -KD [W_Options] Genome.mfa (to write block inf.) spaln -W[Genome.bkp] -KP [W_Options] Genome.mfa (to write block inf.) spaln -W[AAdb.bka] -KA [W_Options] AAdb.faa (to write aa db inf.) spaln -W [Genome.mfa|AAdb.faa] (alternative to makdbs.) spaln [R_options] genomic_segment cDNA.fa (to align) spaln [R_options] genomic_segment protein.fa (to align) spaln [R_options] -dGenome cDNA.fa (to map & align) spaln [R_options] -dGenome protein.fa (to map & align) spaln [R_options] -aAAdb genomic_segment.fa (to search aa database & align) spaln [R_options] -aAAdb protein.fa (to search aa database) (... shortened ...)



Thanks again for the tool and for the help! 
Cheers, 
Dong-Ha
ogotoh commented 2 days ago

Dear Dong-Ha,

Try -Q7 (or -Q4) option insted of -Q4:

$ spaln -Q7 -d gnm -A0 -O10 -t8 -o gnm_aa_1.sam aa_1.fa

Osamu,

ohdongha commented 2 days ago

Try -Q7 (or -Q4) option insted of -Q4:

Thanks a lot, @ogotoh! I missed the detailed instructions about -Q.

However, the SAM output still does not seem to work. When I tried again with -Q7, the SAM header was printed to stdout while "No sam records!" was repeatedly printed to stderr. In the end, the SAM output file (gnm_aa_1.sam below) was empty.

Below, I tried to capture stdout and stderr separately:

$ spaln -Q7 -d gnm -A0 -O10 -t8 -o gnm_aa_1.sam aa_1.fa 1> gnm_aa_1.stdout 2> gnm_aa_1.stderr

$ head gnm_aa_1.stderr
XP_058799706.1  <     0   333 NC_087241.1 10651 10651  14.02   0.00 137 210  6  5 12 10
No sam records !
XP_039314136.1  <     0   112 NC_087229.1  3657  3657  14.10   0.00  32  50  2  2  4  4
No sam records !
XP_034943909.1  >     0   254 NC_087230.1  4714  4714   0.00  13.30 103 140  5  5 10 10
NP_001177534.1  >     0   370 NC_087235.1  7662  7699   7.90   0.00 182 149  9  8 18 16
No sam records !
XP_014235745.1  <     0   683 NC_087229.1  3641  3641   0.00  14.07 316 412 13 12 26 24
No sam records !
No sam records !

$ (head -n5; echo "..."; tail -n5) < gnm_aa_1.stdout | column -t -s$'\t'
@HD  VN:1.3             SO:unsorted
@SQ  SN:NC_087225.1     LN:15008981
@SQ  SN:NC_087226.1     LN:13116906
@SQ  SN:NC_087227.1     LN:12333390
@SQ  SN:NC_087228.1     LN:11548418
...                     
@SQ  SN:NW_026974116.1  LN:7856
@SQ  SN:NW_026974117.1  LN:7739
@SQ  SN:NW_026974118.1  LN:7733
@SQ  SN:NW_026974119.1  LN:7397
@SQ  SN:NW_026974120.1  LN:6325

When I tried other output formats, it worked. For example, the command below printed two GFF and one BED format output files. The BED output (gnm_aa_1.O3) had 16,685 lines with one alignment per line:

spaln -Q7 -d gnm -A0 -O0,2,3 -t8 -pq -o gnm_aa_1 aa_1.fa 

It will be great if the output can be in SAM (or, more preferably, in PAF) due to compatibility with the downstream pipelines.

Thanks again for your help, and please have a look. Cheers, Dong-Ha

ogotoh commented 1 day ago

Dear Dong-Ha,

Sorry, I have forgotten that Spaln does not support SAM output for protein queries, because SAM format is suited for short read mapping, in my opinion. I also do not recommend BED format, because BED format does not discriminate an intron and an ordinary deletion in the query. I guess most people choose GFF3 output.

Osamu,