lgatto / Pbase

Manipluating and exploring protein and proteomics data
8 stars 3 forks source link

Protein coverage #19

Closed lgatto closed 9 years ago

lgatto commented 9 years ago

I have some issues with proteinCoverage. I find the implementation a bit confusion and can't figure out why I end up with NA.

> suppressPackageStartupMessages(library("Pbase"))
> data(p)
> p <- proteinCoverage(p)
> acols(p)$Coverage
    A4UGR9     A6H8Y1     O43707     O75369     P00558     P02545     P04075 
0.10758743 0.10518293 0.08232711         NA 0.16067146 0.18373494 0.59615385 
  P04075-2     P60709 
0.49043062         NA 

but O75369 looks fine to me

plot(p[4])

rplot001

A naive implementation gives the following, which matches proteinCoverage ommitting NAs

> prtl <- width(aa(p))
> pepl <- sapply(pranges(p), function(x) sum(width(reduce(x))))
> cvg <- pepl/prtl
> rbind(proteinCoverage = acols(p)$Coverage, cvg)
                   A4UGR9    A6H8Y1     O43707     O75369    P00558    P02545
proteinCoverage 0.1075874 0.1051829 0.08232711         NA 0.1606715 0.1837349
cvg             0.1075874 0.1051829 0.08232711 0.06379708 0.1606715 0.1837349
                   P04075  P04075-2     P60709
proteinCoverage 0.5961538 0.4904306         NA
cvg             0.5961538 0.4904306 0.03466667

@sgibb, could you clarify what happens and why?

sgibb commented 9 years ago

You are right, it is confusing and contains at least one bug.

The first lines check the arguments and ensure that the order of the pattern (peptides) is the same as in the subject (proteins; mabye we can skip the ordering step? or implement a general function?): ll. 19-41

The crucial step is in ll. 43-46. Here I check whether the peptides is within a protein (I can't remember why I have implemented this check. It should ensure that the peptides sequence is not longer than the protein sequence. Could this really happen? Maybe this is completley superfluous.) While this seems useless it reveals a bug in .splitIRanges in combination with the argument unshift = TRUE. unshift = TRUE assumes an ordered IRangesList (which is not the case in our example). I also can't remember why I implemented the unshift = TRUE option. By using unshift = TRUE we get some IRanges that start at one, e.g.:

ir <- IRanges(start = c(1, 6), end = c(5, 10))
ir
# IRanges of length 2
#     start end width
# [1]     1   5     5
# [2]     6  10     5
Pbase:::.splitIRanges(ir, f = c("a", "b"), unshift = FALSE)
# IRangesList of length 2
# $a
# IRanges of length 1
#     start end width
# [1]     1   5     5
#
# $b
# IRanges of length 1
#     start end width
# [1]     6  10     5

Pbase:::.splitIRanges(ir, f = c("a", "b"), unshift = TRUE)
# IRangesList of length 2
# $a
# IRanges of length 1
#     start end width
# [1]     1   5     5
#
# $b
# IRanges of length 1
#     start end width
# [1]     1   5     5

The last lines ll. 48-51 are exactly the same as the lines you suggested.

AFAIK we don't use the unshift option anywhere. If we avoid the check whether a peptide is in its corresponding protein we can remove the unshift option completely and make the code more readable. Otherwise I need to fix .splitIRanges. What do you think? I tend to remove the unshift argument.

lgatto commented 9 years ago

Unless I miss something:

Hence, my suggestion would be

proteinCoverage <- function(x) {
    stopifnot(is(x, "Proteins"))
    prtl <- width(aa(x))
    pepl <- sapply(pranges(x), function(xx) sum(width(reduce(xx))))
    cvg <- pepl/prtl
    addacol(x, "Coverage", cvg)
}

(which is by the way very slow - probably the sapply over the pranges(x))

sgibb commented 9 years ago

Unless I miss something:

  • we should have any issues with the order - this is guaranteed be the Proteins class: pfeature()[i] come from aa()[i];
  • peptide are expected to be in their respective protein. If there were not, then the Proteins object should not be valid.

Ok I will fix this.

lgatto commented 9 years ago

Just saw a typo - I meant