lgatto / Pbase

Manipluating and exploring protein and proteomics data
8 stars 3 forks source link

.peptidePosition doesn't work for multiple peptides per protein #5

Closed sgibb closed 9 years ago

sgibb commented 10 years ago

The .peptidePosition function (which is not used yet) in peptides-functions.R doesn't find multiple peptides in a protein. Even worse it doesn't find anything if you ask for multiple peptides.

If you look for one peptide per protein everything is fine:

peptides <- c(P1="ABC", P2="DEF")
proteins <- c(P1="ABCD", P2="BCDEF")
.peptidePosition(peptides, proteins)

#IRangesList of length 2
#$P1
#IRanges of length 1
#    start end width
#[1]     1   3     3
#
#$P2
#IRanges of length 1
#    start end width
#[1]     3   5     3
#

But if you want to look for more than one peptides per protein the function fails:

peptides <- c(P1="ABC", P1="DEF")
proteins <- c(P1="ABCDEF", P2="BCDEF")
.peptidePosition(peptides, proteins)
# IRanges of length 0

Because the function isn't used yet this is just a reminder.

sgibb commented 10 years ago

Currently .peptidePosition returns the ranges of the peptide (pattern) in the protein (subject):

pattern <- c(P1="ABC")
subject <- c(P1="ABCDEFABCD")
.peptidePosition(pattern, subject)
# IRangesList of length 1
# $P1
# IRanges of length 2
#     start end width
# [1]     1   3     3
# [2]     7   9     3

I am wondering what the output of the following should be:

pattern <- c(P1="ABC", P1="DEF")
subject <- c(P1="ABCDEFABCD")
.peptidePosition(pattern, subject)

How to distinguish between the two different patterns. Should we return a list of IRangesLists? Or should we use the iranges@elementMetadata to store the index of the pattern? Or do we overcomplicate the problem and it would be easier to provide a non-vectorized function that looks for just one peptide in just one protein?

lgatto commented 10 years ago

How to distinguish between the two different patterns. Should we return a list of IRangesLists? Or should we use the iranges@elementMetadata to store the index of the pattern? Or do we overcomplicate the problem and it would be easier to provide a non-vectorized function that looks for just one peptide in just one protein?

If it was only for 1 single subject, an IRangesList would make sense, but I don't feel comfortable with a list of IRanglesList. I prefer an IRangesList with the pattern's index as an elementMetadata. I think that a non-vectorised form would inevitably lead to repeatedly writing apply(apply(...)) or for loops.

sgibb commented 10 years ago

I hope the problem is fixed now. I will merge the changes into master myself if it works in #6 .

The output of .peptidesPosition is now:

pattern <- c(P1 = "DE", P2 = "IL", P3 = "KR", P1 = "ABC", P3 = "LMN")
subject <- c(P1 = "ABCDE", P3 = "JKLMN", P2 = "FGHIL")
result <- .peptidePosition(pattern, subject)

#IRangesList of length 3
#$P1
#IRanges of length 2
#    start end width
#[1]     4   5     2
#[2]     1   3     3

#$P2
#IRanges of length 1
#    start end width
#[1]     4   5     2

#$P3
#IRanges of length 1
#    start end width
#[1]     3   5     3

lapply(result, mcols) # or mcols(unlist(result))
#$P1
#DataFrame with 2 rows and 2 columns
#  PeptideIndex ProteinIndex
#         <Rle>        <Rle>
#1            1            1
#2            4            1

#$P2
#DataFrame with 1 row and 2 columns
#  PeptideIndex ProteinIndex
#         <Rle>        <Rle>
#1            2            3

#$P3
#DataFrame with 1 row and 2 columns
#  PeptideIndex ProteinIndex
#         <Rle>        <Rle>
#1            5            2
lgatto commented 10 years ago

Yes, looks good. Thanks!

lgatto commented 10 years ago

@sgibb this doesn't look like it has been merged yet. Was there another issue?

sgibb commented 10 years ago

@lgatto to be honest I completely forgot this problem. But there are reasons why it is not merged yet.

  1. .peptidePosition is not needed by any function except the heavyLabels function in #6.
  2. .peptidePosition is slow because calling gregexpr multiple times is slow (it is called for each Protein == each AAString in @aa). But I have no idea to make it faster. Maybe we have to address the speed later.
  3. heavyLabels is not merged yet because I get 5 results (out of 389) that don't match with the results @pavel-shliaha gaves me. I asked him (by adding him to the PR) to verify these 5 results. Maybe I should remind him by an email.
lgatto commented 10 years ago

Ok, no worries.

sgibb commented 9 years ago

The peptides are verified by @pavel-shliaha. Regarding the long run time of .peptidePosition I am going to experiment with stringi which seems to be very fast. Do you mind a new dependency? (sadly with some system requirements)

lgatto commented 9 years ago

I am not very keen on the system requirement. (1) Have you tried stringr? - I think it's only a set of convenient/cleaned-up wrappers rather that a focus on speed, though. (2) Could you possibly put stringi in Suggests.

Edit - on my laptop, install.packages("stringi") works out of the box, but it would be good to do some testing on other platforms and document the dependency and installation in a README file.

sgibb commented 9 years ago

I think I solved the speed problem now. In 6571952ba56c5e8d86efa284a7f0a34d09ace6a3 I just call as.character outside the for loop and get a massive improvement in speed. According to Rprof the slow part was the S4 dispatching that takes nearly 98% of the whole run time (the most expensive calls were standardGeneric, callNextMethod, gregexpr and [). Now .peptidePositions is fast. It solves @pavel-shliaha's test file for the heavy labeled peptides in just a second (before it takes nearly 20 seconds).

P.S.: The stringi::stri_locate_all_fixed implementation is 1.02 times faster than the current implementation. IMHO this difference is not enough add another dependency yet (stringr::str_locate_all is nearly 5 times slower).