missuse / ragp

Filter plant hydroxyproline rich glycoproteins
MIT License
5 stars 4 forks source link

handling stop codons #4

Closed TS404 closed 6 years ago

TS404 commented 6 years ago

Functions should either tolerate stop codons or strip them out. Currently predict_hyp gives the error:

Warning message:
In FUN(X[[i]], ...) :
  Characters other than single letter code for amino acids are present
missuse commented 6 years ago

Currently if the sequence ends with an asterisk (*) it is removed automatically without a warning.

However when there are amino acids other than the 20 common ones predict_hyp throws a warning since k-mers with such letters are not supported. The prediction is returned normally for other kmers and for the problematic ones NA is returned.

So the warning is there to tell the user he has odd amino acids in the sequences.

I am not inclined to change this behavior, perhaps just a more informative warning message:

"Characters other than single letter code for amino acids are present. K-mers with non AA symbols will return NA"

missuse commented 6 years ago

The default behavior is as follows:

library(ragp)
data(at_nsp)
seq <- at_nsp$sequence[16]

when an * is present at the end of the sequence it is simply removed:

predict_hyp(sequence = paste0(seq, "*"),
                    id = "test1")

output

$prediction
      id                substr P_pos        prob HYP
1  test1 AYCGTGCRSGPCSSSTTPIPP    63 0.406462789 Yes
2  test1 RSGPCSSSTTPIPPTPSGGAG    70 0.958923936 Yes
3  test1 GPCSSSTTPIPPTPSGGAGGL    72 0.691760421 Yes
4  test1 PCSSSTTPIPPTPSGGAGGLN    73 0.891185045 Yes
5  test1 SSSTTPIPPTPSGGAGGLNAD    75 0.958876491 Yes
6  test1 SGGAGGLNADPRDTIENVVTP    86 0.029517453  No
7  test1 PRDTIENVVTPAFFDGIMSKV    96 0.017849574  No
8  test1 GIMSKVGNGCPAKGFYTRQAF   111 0.016302699  No
9  test1 EEIARGKYCSPSTAYPCTPGK   168 0.022101192  No
10 test1 GKYCSPSTAYPCTPGKDYYGR   173 0.023860911  No
11 test1 CSPSTAYPCTPGKDYYGRGPI   176 0.020548651  No
12 test1 TPGKDYYGRGPIQITWNYNYG   185 0.033196110  No
13 test1 YGAAGKFLGLPLLTDPDMVAR   204 0.017053660  No
14 test1 KFLGLPLLTDPDMVARSPQVA   209 0.017078023  No
15 test1 LTDPDMVARSPQVAFQCAMWF   216 0.017504448  No
16 test1 AMWFWNLNVRPVLDQGFGATT   233 0.016318750  No
17 test1 INGGECNGRRPAAVQSRVNYY   256 0.037374441  No
18 test1         RTLGITPGANLSC   277 0.002667858  No

$sequence
                                                                                                                                                                                                                                                                                     sequence
1 MATLRAMLKNAFILFLFTLTIMAKTVFSQQCGTTGCAANLCCSRYGYCGTTDAYCGTGCRSGOCSSSTTOIOOTOSGGAGGLNADPRDTIENVVTPAFFDGIMSKVGNGCPAKGFYTRQAFIAAAQSFDAYKGTVAKREIAAMLAQFSHESGSFCYKEEIARGKYCSPSTAYPCTPGKDYYGRGPIQITWNYNYGAAGKFLGLPLLTDPDMVARSPQVAFQCAMWFWNLNVRPVLDQGFGATTRKINGGECNGRRPAAVQSRVNYYLEFCRTLGITPGANLSC
     id
1 test1

when there are symbols not corresponding to the 20 amino acids:

substr(seq, 60, 60) <- "B"
predict_hyp(sequence = paste0(seq, "*"),
                    id = "test1")

output:

$prediction
      id                substr P_pos        prob  HYP
1  test1 AYCGTGCBSGPCSSSTTPIPP    63          NA <NA>
2  test1 BSGPCSSSTTPIPPTPSGGAG    70          NA <NA>
3  test1 GPCSSSTTPIPPTPSGGAGGL    72 0.691760421  Yes
4  test1 PCSSSTTPIPPTPSGGAGGLN    73 0.891185045  Yes
5  test1 SSSTTPIPPTPSGGAGGLNAD    75 0.958876491  Yes
6  test1 SGGAGGLNADPRDTIENVVTP    86 0.029517453   No
7  test1 PRDTIENVVTPAFFDGIMSKV    96 0.017849574   No
8  test1 GIMSKVGNGCPAKGFYTRQAF   111 0.016302699   No
9  test1 EEIARGKYCSPSTAYPCTPGK   168 0.022101192   No
10 test1 GKYCSPSTAYPCTPGKDYYGR   173 0.023860911   No
11 test1 CSPSTAYPCTPGKDYYGRGPI   176 0.020548651   No
12 test1 TPGKDYYGRGPIQITWNYNYG   185 0.033196110   No
13 test1 YGAAGKFLGLPLLTDPDMVAR   204 0.017053660   No
14 test1 KFLGLPLLTDPDMVARSPQVA   209 0.017078023   No
15 test1 LTDPDMVARSPQVAFQCAMWF   216 0.017504448   No
16 test1 AMWFWNLNVRPVLDQGFGATT   233 0.016318750   No
17 test1 INGGECNGRRPAAVQSRVNYY   256 0.037374441   No
18 test1         RTLGITPGANLSC   277 0.002667858   No

$sequence
  sequence    id
1     <NA> test1

Warning message:
In FUN(X[[i]], ...) :
  characters other than single letter code for amino acids are present

Only k-mers that contain the odd symbol are skipped, other predictions are returned in the prediction element while the sequence element returns NA. And a warning is issued.

When the odd acid symbol is not in any k-mer:

seq <- at_nsp$sequence[16]
substr(seq, 15, 15) <- "B"

they are just ignored (no warning) and the output is if they were not there:

$prediction
      id                substr P_pos        prob HYP
1  test1 AYCGTGCRSGPCSSSTTPIPP    63 0.406462789 Yes
2  test1 RSGPCSSSTTPIPPTPSGGAG    70 0.958923936 Yes
3  test1 GPCSSSTTPIPPTPSGGAGGL    72 0.691760421 Yes
4  test1 PCSSSTTPIPPTPSGGAGGLN    73 0.891185045 Yes
5  test1 SSSTTPIPPTPSGGAGGLNAD    75 0.958876491 Yes
6  test1 SGGAGGLNADPRDTIENVVTP    86 0.029517453  No
7  test1 PRDTIENVVTPAFFDGIMSKV    96 0.017849574  No
8  test1 GIMSKVGNGCPAKGFYTRQAF   111 0.016302699  No
9  test1 EEIARGKYCSPSTAYPCTPGK   168 0.022101192  No
10 test1 GKYCSPSTAYPCTPGKDYYGR   173 0.023860911  No
11 test1 CSPSTAYPCTPGKDYYGRGPI   176 0.020548651  No
12 test1 TPGKDYYGRGPIQITWNYNYG   185 0.033196110  No
13 test1 YGAAGKFLGLPLLTDPDMVAR   204 0.017053660  No
14 test1 KFLGLPLLTDPDMVARSPQVA   209 0.017078023  No
15 test1 LTDPDMVARSPQVAFQCAMWF   216 0.017504448  No
16 test1 AMWFWNLNVRPVLDQGFGATT   233 0.016318750  No
17 test1 INGGECNGRRPAAVQSRVNYY   256 0.037374441  No
18 test1         RTLGITPGANLSC   277 0.002667858  No

$sequence
                                                                                                                                                                                                                                                                                     sequence
1 MATLRAMLKNAFILBLFTLTIMAKTVFSQQCGTTGCAANLCCSRYGYCGTTDAYCGTGCRSGOCSSSTTOIOOTOSGGAGGLNADPRDTIENVVTPAFFDGIMSKVGNGCPAKGFYTRQAFIAAAQSFDAYKGTVAKREIAAMLAQFSHESGSFCYKEEIARGKYCSPSTAYPCTPGKDYYGRGPIQITWNYNYGAAGKFLGLPLLTDPDMVARSPQVAFQCAMWFWNLNVRPVLDQGFGATTRKINGGECNGRRPAAVQSRVNYYLEFCRTLGITPGANLSC
     id
1 test1

How do you think can this behavior be improved.

missuse commented 6 years ago

Closing this issue.

If you think the behavior of predict_hyp can be improved in regards the non AA symbols please post another issue.

Providing predictions for k-mers with arbitrary symbols (non AA) currently seems a futile task.