spaln v3 instruction to align protein sequences to a genome

ohdongha commented 5 days ago

Dear Author, Thank you for developing and updating an aligner capable of aligning proteins to a genome on a large scale.

I want to align protein sequences collected from species in the same order (file name aa_1.fa with 25,000 protein sequences) to a genome sequence (file name gnm.gf) so that the spliced alignments can guide the genome annotation process.

I installed the version 3.0.6a <240916> and below is what I tried and the messages printed to the screen:

$ spaln -W -KP gnm.gf
Gd:19306111 No:44658087 My:35802 MS:385 Mb:260 Tw:37711716 Tl:287737303  30.17   0.06  69.78
#Segs 13684, TabSize 64000000, Words: 37711716, GenomeSize 194452999, GIDs 246

$ spaln -Q0 -A0 -O10 -t8 -o gnm_aa_1.sam gnm.bkp aa_1.fa
(... printed some amino acid sequences on the screen ...)

$ column -t -s$'\t' gnm_aa_1.sam
XP_008552390.2   0  gnm.bkp  6241  69   47M3I17M2I7M85H           *  0  0    *  NM:i:64   AS:i:37
XP_011135446.1   0  gnm.bkp  5350  121  55M1D27M                  *  0  0    *  NM:i:75   AS:i:36
XP_026827363.1   0  gnm.bkp  3225  111  44M7I15M2D16M4D47M        *  0  0    *  NM:i:105  AS:i:47
NP_722687.1      0  gnm.bkp  3982  119  13M1I58M5D9M1I31M3I23M    *  0  0    *  NM:i:124  AS:i:37
NP_524299.1      0  gnm.bkp  3757  110  10M1D5M1D25M              *  0  0    *  NM:i:32   AS:i:50
XP_023317772.1   0  gnm.bkp  4948  126  129M1I39M3D40M            *  0  0    *  NM:i:202  AS:i:41
XP_008556624.1   0  gnm.bkp  3199  113  30M8D48M                  *  0  0    *  NM:i:65   AS:i:47
XP_058807296.1   0  gnm.bkp  5057  127  37M3I40M1D39M1D72M1D128M  *  0  0    *  NM:i:311  AS:i:40
NP_001262532.1   0  gnm.bkp  3196  120  41M4D42M                  *  0  0    *  NM:i:75   AS:i:35
XP_015509306.1   0  gnm.bkp  3302  112  7M1I80M                   *  0  0    *  NM:i:73   AS:i:55
NP_001036778.1   0  gnm.bkp  3191  106  15M1I31M                  *  0  0    *  NM:i:36   AS:i:44
XP_047359205.1   0  gnm.bkp  3225  118  34M4I40M5D28M             *  0  0    *  NM:i:94   AS:i:37
XP_048260883.1   0  gnm.bkp  3611  121  18M1I28M2D47M2I14M1I18M   *  0  0    *  NM:i:118  AS:i:38
XP_003707593.1   0  gnm.bkp  3200  111  37M2I46M13I5M7I58M6I13M   *  0  0    *  NM:i:152  AS:i:35

The run finished within a few minutes, and the output was much smaller than I expected.

I have these questions:

For the purpose I described, are the codes above correct?
The SAM printed the genome sequence IDs as gnm.bkp instead of the chromosome/scaffold names. The mapping position in the 4th column does not seem to be the genome coordinate either. Is there an additional step needed to interpret the SAM output above?
I expected many more alignments, but only a handful was retrieved, and the run was very short. Is there any other setting I would need to add (e.g., related to memory)?
In the help (copied below), there are usages "(to align)" and "(to map & align)" protein.fa to the genome.
- What is the difference between the two?
- I have tried -d gnm, -d gnm.gf, -d gnm.bkp, etc., but they all gave "Can't open query !" What is the correct way to map & align in this case?
```
$ spaln -h
*** SPALN_AVX2 version 3.0.6a <240916> ***
```

Usage: spaln -W[Genome.bkn] -KD [W_Options] Genome.mfa (to write block inf.) spaln -W[Genome.bkp] -KP [W_Options] Genome.mfa (to write block inf.) spaln -W[AAdb.bka] -KA [W_Options] AAdb.faa (to write aa db inf.) spaln -W [Genome.mfa|AAdb.faa] (alternative to makdbs.) spaln [R_options] genomic_segment cDNA.fa (to align) spaln [R_options] genomic_segment protein.fa (to align) spaln [R_options] -dGenome cDNA.fa (to map & align) spaln [R_options] -dGenome protein.fa (to map & align) spaln [R_options] -aAAdb genomic_segment.fa (to search aa database & align) spaln [R_options] -aAAdb protein.fa (to search aa database) (... shortened ...)



Thanks again for the tool and for the help! 
Cheers, 
Dong-Ha

ogotoh commented 2 days ago

Dear Dong-Ha,

Try -Q7 (or -Q4) option insted of -Q4:

$ spaln -Q7 -d gnm -A0 -O10 -t8 -o gnm_aa_1.sam aa_1.fa

Osamu,

ohdongha commented 2 days ago

Try -Q7 (or -Q4) option insted of -Q4:

Thanks a lot, @ogotoh! I missed the detailed instructions about -Q.

However, the SAM output still does not seem to work. When I tried again with -Q7, the SAM header was printed to stdout while "No sam records!" was repeatedly printed to stderr. In the end, the SAM output file (gnm_aa_1.sam below) was empty.

Below, I tried to capture stdout and stderr separately:

$ spaln -Q7 -d gnm -A0 -O10 -t8 -o gnm_aa_1.sam aa_1.fa 1> gnm_aa_1.stdout 2> gnm_aa_1.stderr

$ head gnm_aa_1.stderr
XP_058799706.1  <     0   333 NC_087241.1 10651 10651  14.02   0.00 137 210  6  5 12 10
No sam records !
XP_039314136.1  <     0   112 NC_087229.1  3657  3657  14.10   0.00  32  50  2  2  4  4
No sam records !
XP_034943909.1  >     0   254 NC_087230.1  4714  4714   0.00  13.30 103 140  5  5 10 10
NP_001177534.1  >     0   370 NC_087235.1  7662  7699   7.90   0.00 182 149  9  8 18 16
No sam records !
XP_014235745.1  <     0   683 NC_087229.1  3641  3641   0.00  14.07 316 412 13 12 26 24
No sam records !
No sam records !

$ (head -n5; echo "..."; tail -n5) < gnm_aa_1.stdout | column -t -s$'\t'
@HD  VN:1.3             SO:unsorted
@SQ  SN:NC_087225.1     LN:15008981
@SQ  SN:NC_087226.1     LN:13116906
@SQ  SN:NC_087227.1     LN:12333390
@SQ  SN:NC_087228.1     LN:11548418
...                     
@SQ  SN:NW_026974116.1  LN:7856
@SQ  SN:NW_026974117.1  LN:7739
@SQ  SN:NW_026974118.1  LN:7733
@SQ  SN:NW_026974119.1  LN:7397
@SQ  SN:NW_026974120.1  LN:6325

When I tried other output formats, it worked. For example, the command below printed two GFF and one BED format output files. The BED output (gnm_aa_1.O3) had 16,685 lines with one alignment per line:

spaln -Q7 -d gnm -A0 -O0,2,3 -t8 -pq -o gnm_aa_1 aa_1.fa

It will be great if the output can be in SAM (or, more preferably, in PAF) due to compatibility with the downstream pipelines.

Thanks again for your help, and please have a look. Cheers, Dong-Ha

ogotoh commented 1 day ago

Dear Dong-Ha,

Sorry, I have forgotten that Spaln does not support SAM output for protein queries, because SAM format is suited for short read mapping, in my opinion. I also do not recommend BED format, because BED format does not discriminate an intron and an ordinary deletion in the query. I guess most people choose GFF3 output.

Osamu,

ogotoh / spaln

spaln v3 instruction to align protein sequences to a genome #77