Open ohdongha opened 1 week ago
Dear Dong-Ha,
Such an anomaly can occur at the first or the last exon when a predicted codon does not align with any query residue. You may see what happens by running Spaln with -O1 option (alignment mode). For your example, the following command (-T option is added for potentially better performance)
$ spaln -Q7 -d gnm_test -T InsectAp -O1 ' toy_example.fa (2)'
will produces
>NC_087227.1 [1:256000] ( 8151041 - 8407040 ) - >XP_023316908.1 [1:554] ( 1 - 554 )
;C join(8278318..8278415,8278534..8278788,8278895..8279036,
;C 8279114..8279265,8279406..8279580,8279708..8279809,8279939..8280231,
;C 8280308..8280625,8280780..8280923,8282427..8282430)
PAM = 150, BIAS = 0.0, u = 2.0, v = 9.0
Score = 2365.8 (2380.7), 424.0 (=), 130.0 (#), 2.0 (g), 6.0 (u), (75.71 %)
ALIGNMENT 1 / 1
M T L P F H G T E P R K K E E I L V H A
8278318 ATGACACTCCCCTTCCACGGGACTGAGCCGCGAAAAAAAGAGGAAATCCTCGTCCACGCG| NC_087227.1
1 M S T P P H G T E V R K S N E I L E H A | XP_023316908.1
K D F L D Q Y F T S I R R
8278378 AAGGACTTCCTGGACCAATATTTCACGTCCATTCGGAGgtgagttaatcattcagtgaat| NC_087227.1
21 K D F L N Q Y F T S I K R | XP_023316908.1
...
T F G P L J N V R
8280898 ACATTCGGACCGTTGAGTAACGTTAGgttcgctgtatttgctctcggctcatcggcctat| NC_087227.1
546 N F G P L S N V R | XP_023316908.1
;; skip 1440 nt's
O
8282398 ggagagccactctacgtgtttgtccgcagGTAG | NC_087227.1
555 - | XP_023316908.1
Actually, I have no idea how to properly represent this situation in Gff3 format. In the default (-O4) output format, you can see that the genomic sequence does not match any part of the query.
I guess one possibility that the query amino acid sequence is not full length, lacking a C-terminus region, so that Spaln forcibly finds a nearby termination codon in this example.
I thought that -LS or -LC (local similarity) option can prevent such anomalies, but the current implementation does not work as expected. I will check the behavior of Spaln when -LS (or -LC) option is given.
Osamu,
Dear Osamu, @ogotoh
I followed your instruction from issue 77 to successfully map and align protein sequences to a genome. However, the "match format" GFF3 appears to have some glitches.
I downloaded the genome sequence from NCBI:
The protein sequence input (
toy_example.fa
) was as follows:Codes to run Spaln (
version 3.0.6a <240916>
)Then the output
toy_example.gff3
includes multiple places where the start and end locations appear swapped:For example, please see this part where the end (2108443) is smaller than the start (2108444):
And also, the 9th column of the same line (181 > 180):
This pattern happens at the end of each alignment for these sequences.
Please let me know if you have an issue reproducing this result or anything else. Thanks again!
Cheers, Dong-Ha