Closed TS404 closed 6 years ago
Currently if the sequence ends with an asterisk (*) it is removed automatically without a warning.
However when there are amino acids other than the 20 common ones predict_hyp
throws a warning since k-mers with such letters are not supported. The prediction is returned normally for other kmers and for the problematic ones NA is returned.
So the warning is there to tell the user he has odd amino acids in the sequences.
I am not inclined to change this behavior, perhaps just a more informative warning message:
"Characters other than single letter code for amino acids are present. K-mers with non AA symbols will return NA"
The default behavior is as follows:
library(ragp)
data(at_nsp)
seq <- at_nsp$sequence[16]
when an *
is present at the end of the sequence it is simply removed:
predict_hyp(sequence = paste0(seq, "*"),
id = "test1")
output
$prediction
id substr P_pos prob HYP
1 test1 AYCGTGCRSGPCSSSTTPIPP 63 0.406462789 Yes
2 test1 RSGPCSSSTTPIPPTPSGGAG 70 0.958923936 Yes
3 test1 GPCSSSTTPIPPTPSGGAGGL 72 0.691760421 Yes
4 test1 PCSSSTTPIPPTPSGGAGGLN 73 0.891185045 Yes
5 test1 SSSTTPIPPTPSGGAGGLNAD 75 0.958876491 Yes
6 test1 SGGAGGLNADPRDTIENVVTP 86 0.029517453 No
7 test1 PRDTIENVVTPAFFDGIMSKV 96 0.017849574 No
8 test1 GIMSKVGNGCPAKGFYTRQAF 111 0.016302699 No
9 test1 EEIARGKYCSPSTAYPCTPGK 168 0.022101192 No
10 test1 GKYCSPSTAYPCTPGKDYYGR 173 0.023860911 No
11 test1 CSPSTAYPCTPGKDYYGRGPI 176 0.020548651 No
12 test1 TPGKDYYGRGPIQITWNYNYG 185 0.033196110 No
13 test1 YGAAGKFLGLPLLTDPDMVAR 204 0.017053660 No
14 test1 KFLGLPLLTDPDMVARSPQVA 209 0.017078023 No
15 test1 LTDPDMVARSPQVAFQCAMWF 216 0.017504448 No
16 test1 AMWFWNLNVRPVLDQGFGATT 233 0.016318750 No
17 test1 INGGECNGRRPAAVQSRVNYY 256 0.037374441 No
18 test1 RTLGITPGANLSC 277 0.002667858 No
$sequence
sequence
1 MATLRAMLKNAFILFLFTLTIMAKTVFSQQCGTTGCAANLCCSRYGYCGTTDAYCGTGCRSGOCSSSTTOIOOTOSGGAGGLNADPRDTIENVVTPAFFDGIMSKVGNGCPAKGFYTRQAFIAAAQSFDAYKGTVAKREIAAMLAQFSHESGSFCYKEEIARGKYCSPSTAYPCTPGKDYYGRGPIQITWNYNYGAAGKFLGLPLLTDPDMVARSPQVAFQCAMWFWNLNVRPVLDQGFGATTRKINGGECNGRRPAAVQSRVNYYLEFCRTLGITPGANLSC
id
1 test1
when there are symbols not corresponding to the 20 amino acids:
substr(seq, 60, 60) <- "B"
predict_hyp(sequence = paste0(seq, "*"),
id = "test1")
output:
$prediction
id substr P_pos prob HYP
1 test1 AYCGTGCBSGPCSSSTTPIPP 63 NA <NA>
2 test1 BSGPCSSSTTPIPPTPSGGAG 70 NA <NA>
3 test1 GPCSSSTTPIPPTPSGGAGGL 72 0.691760421 Yes
4 test1 PCSSSTTPIPPTPSGGAGGLN 73 0.891185045 Yes
5 test1 SSSTTPIPPTPSGGAGGLNAD 75 0.958876491 Yes
6 test1 SGGAGGLNADPRDTIENVVTP 86 0.029517453 No
7 test1 PRDTIENVVTPAFFDGIMSKV 96 0.017849574 No
8 test1 GIMSKVGNGCPAKGFYTRQAF 111 0.016302699 No
9 test1 EEIARGKYCSPSTAYPCTPGK 168 0.022101192 No
10 test1 GKYCSPSTAYPCTPGKDYYGR 173 0.023860911 No
11 test1 CSPSTAYPCTPGKDYYGRGPI 176 0.020548651 No
12 test1 TPGKDYYGRGPIQITWNYNYG 185 0.033196110 No
13 test1 YGAAGKFLGLPLLTDPDMVAR 204 0.017053660 No
14 test1 KFLGLPLLTDPDMVARSPQVA 209 0.017078023 No
15 test1 LTDPDMVARSPQVAFQCAMWF 216 0.017504448 No
16 test1 AMWFWNLNVRPVLDQGFGATT 233 0.016318750 No
17 test1 INGGECNGRRPAAVQSRVNYY 256 0.037374441 No
18 test1 RTLGITPGANLSC 277 0.002667858 No
$sequence
sequence id
1 <NA> test1
Warning message:
In FUN(X[[i]], ...) :
characters other than single letter code for amino acids are present
Only k-mers that contain the odd symbol are skipped, other predictions are returned in the prediction
element while the sequence
element returns NA. And a warning is issued.
When the odd acid symbol is not in any k-mer:
seq <- at_nsp$sequence[16]
substr(seq, 15, 15) <- "B"
they are just ignored (no warning) and the output is if they were not there:
$prediction
id substr P_pos prob HYP
1 test1 AYCGTGCRSGPCSSSTTPIPP 63 0.406462789 Yes
2 test1 RSGPCSSSTTPIPPTPSGGAG 70 0.958923936 Yes
3 test1 GPCSSSTTPIPPTPSGGAGGL 72 0.691760421 Yes
4 test1 PCSSSTTPIPPTPSGGAGGLN 73 0.891185045 Yes
5 test1 SSSTTPIPPTPSGGAGGLNAD 75 0.958876491 Yes
6 test1 SGGAGGLNADPRDTIENVVTP 86 0.029517453 No
7 test1 PRDTIENVVTPAFFDGIMSKV 96 0.017849574 No
8 test1 GIMSKVGNGCPAKGFYTRQAF 111 0.016302699 No
9 test1 EEIARGKYCSPSTAYPCTPGK 168 0.022101192 No
10 test1 GKYCSPSTAYPCTPGKDYYGR 173 0.023860911 No
11 test1 CSPSTAYPCTPGKDYYGRGPI 176 0.020548651 No
12 test1 TPGKDYYGRGPIQITWNYNYG 185 0.033196110 No
13 test1 YGAAGKFLGLPLLTDPDMVAR 204 0.017053660 No
14 test1 KFLGLPLLTDPDMVARSPQVA 209 0.017078023 No
15 test1 LTDPDMVARSPQVAFQCAMWF 216 0.017504448 No
16 test1 AMWFWNLNVRPVLDQGFGATT 233 0.016318750 No
17 test1 INGGECNGRRPAAVQSRVNYY 256 0.037374441 No
18 test1 RTLGITPGANLSC 277 0.002667858 No
$sequence
sequence
1 MATLRAMLKNAFILBLFTLTIMAKTVFSQQCGTTGCAANLCCSRYGYCGTTDAYCGTGCRSGOCSSSTTOIOOTOSGGAGGLNADPRDTIENVVTPAFFDGIMSKVGNGCPAKGFYTRQAFIAAAQSFDAYKGTVAKREIAAMLAQFSHESGSFCYKEEIARGKYCSPSTAYPCTPGKDYYGRGPIQITWNYNYGAAGKFLGLPLLTDPDMVARSPQVAFQCAMWFWNLNVRPVLDQGFGATTRKINGGECNGRRPAAVQSRVNYYLEFCRTLGITPGANLSC
id
1 test1
How do you think can this behavior be improved.
Closing this issue.
If you think the behavior of predict_hyp
can be improved in regards the non AA symbols please post another issue.
Providing predictions for k-mers with arbitrary symbols (non AA) currently seems a futile task.
Functions should either tolerate stop codons or strip them out. Currently
predict_hyp
gives the error: