pcols accessor and replacement method

lgatto commented 8 years ago

@sgibb - what do you think of the following?

## Convenience accessors and replacement method
setMethod("$", "Proteins",
          function(x, name) {
              eval(substitute(x@pranges@unlistData@elementMetadata$NAME_ARG,
                              list(NAME_ARG=name)))
          })

setReplaceMethod("$", "Proteins", function(x, name, value) {
    x@pranges@unlistData@elementMetadata[[name]] <- value
    x
})

Usage:

data(p)
> data(p)
> p$FOO <- 1:137
> p$FOO
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100
 [ reached getOption("max.print") -- omitted 37 entries ]
> pcols(p)
SplitDataFrameList of length 9
$A4UGR9
DataFrame with 36 rows and 29 columns
       DB AccessionNumber   EntryName IsoformName
    <Rle>     <character> <character>       <Rle>
1      sp          A4UGR9 XIRP2_HUMAN          NA
2      sp          A4UGR9 XIRP2_HUMAN          NA
                                                            ProteinName
                                                            <character>
1   sp|A4UGR9|XIRP2_HUMAN Xin actin-binding repeat-containing protein 2
2   sp|A4UGR9|XIRP2_HUMAN Xin actin-binding repeat-containing protein 2
    OrganismName GeneName          ProteinExistence SequenceVersion Comment
           <Rle>    <Rle>                     <Rle>           <Rle>   <Rle>
1   Homo sapiens    XIRP2 Evidence at protein level               2      NA
2   Homo sapiens    XIRP2 Evidence at protein level               2      NA
    spectrumID chargeState      rank passThreshold experimentalMassToCharge
      <factor>   <integer> <integer>     <logical>                <numeric>
1    index=124           3         1          TRUE                 715.0305
2     index=28           2         1          TRUE                 715.9177
    calculatedMassToCharge           sequence    modNum   isDecoy     post
                 <numeric>           <factor> <integer> <logical> <factor>
1                 715.0308 QEITQNKSFFSSVKESQR         0     FALSE        D
2                 715.4117       LPVPKDVYSKQR         0     FALSE        N
         pre     start       end        DatabaseAccess DBseqLength DatabaseSeq
    <factor> <integer> <integer>              <factor>   <integer>    <factor>
1          K      2743      2760 sp|A4UGR9|XIRP2_HUMAN        3374            
2          R       307       318 sp|A4UGR9|XIRP2_HUMAN        3374            
    acquisitionNum                      filenames       FOO
         <numeric>                          <Rle> <integer>
1              124 Thermo_Hela_PRTC_selected.mzid         1
2               28 Thermo_Hela_PRTC_selected.mzid         2
 [ reached getOption("max.print") -- omitted 9 rows ]

...
<8 more elements>

I am unsure if this is best practice for CompressedSplitDataFrameList, though.

This is in relation to this question.

sgibb commented 8 years ago

Mh, I am not convinced. E.g. in data.frame [ and $ are similar (the latter "supports" pmatch), both access a column:

d <- data.frame(filename=letters[1:3], value=1:3)
d["filename"]
d$filename

In a Protein object [ creates a subset and $ would access/modify metadata. Wouldn't this be counterintuitive?

For the mentioned question I would suggest to add an addIdentificationData method for mzID objects or a data.frame (as we have done in MSnbase).

lgatto commented 8 years ago

In a Protein object [ creates a subset and $ would access/modify metadata. Wouldn't this be counterintuitive?

Yes, I suppose so. The idea stems from pcols(x) being a CompressedSplitDataFrameList.

lgatto commented 8 years ago

For the mentioned question I would suggest to add an addIdentificationData method for mzID objects or a data.frame (as we have done in MSnbase).

I can't remember if the MSnSet output of MSnID (let's call it x) has all the metadata that was originally in the mzID files. We could add identification data from fData(x), but not sure if ideal.

Now, by adding all the mzID files, OP has a Proteins object with all the peptides. My idea was to subset the pranges(p) with something like:

p$delta <- p$experimentalMassToCharge - p$calculatedMassToCharge
sel <- abs(p$delta)  < 0.35
pranges(p) <- pranges(p)[sel]

Well, the last line is just an illustration, of course, just to give you an idea. Maybe we need a dedicated function for that, possibly with non-standard evaluation:

p2 <- subset(p,  abs(delta) < 0.035)

sgibb commented 8 years ago

Ok, now I understand want you (or the OP) want to achieve. I don't think we need any new subsetting here. The subset operator [ of DataFrameList supports a special matrix-like subsetting syntax ([i, j]; see ?DataFrameList for details): by setting i to missing you can loop over the list (which is not possible with classic R lists):

library("Pbase")
data(p)

## stupid example to determine the length of a seqence
delta <- pcols(p)[, "end"] - pcols(p)[, "start"]
# IntegerList of length 9
# [["A4UGR9"]] 17 11 12 9 15 12 12 9 18 15 12 8 ... 12 14 15 22 13 12 11 17 13 12 13
# [["A6H8Y1"]] 17 7 8 17 7 17 16 11 16 17 10 20 12 20 25 9 9 16 16 19 9 10 26
# [["O43707"]] 14 17 12 12 15 11
# [["O75369"]] 14 11 15 13 8 10 10 13 23 20 10 12 14
# [["P00558"]] 13 7 17 15 10
# [["P02545"]] 10 17 10 11 15 16 19 16 13 17 16 17
# [["P04075"]] 22 22 19 17 26 12 26 10 27 8 29 7 14 13 21 6 19 26 11 26 21
# [["P04075-2"]] 22 22 19 17 26 12 26 10 27 8 29 7 14 13 21 6 19 26 26 21
# [["P60709"]] 12

## subset by length
pranges(p)[delta > 20]
# IRangesList of length 9
# $A4UGR9
# IRanges of length 1
#     start  end width  names
# [1]  2710 2732    23 A4UGR9
#
# $A6H8Y1
# IRanges of length 2
#     start end width  names
# [1]    21  46    26 A6H8Y1
# [2]     5  31    27 A6H8Y1
#
# $O43707
# IRanges of length 0
#
# ...
# <6 more elements>

The OP's example should be possible with the following code snippet:

delta <- pcols(p)[, "experimentalMassToCharge"] -
         pcols(p)[, "calculatedMassToCharge"]
sel <- abs(delta) < 0.35
pranges(p)[sel]

Using pcol(x)[, "foo"] is not as easy to type as x$foo but would not confuse with classical data.frame subsetting and would be similar to MSnbase's fData(x)["foo"].

But to get what the OP want to do we need a replacement method for pranges<- (and we could add one for pcols, too).

lgatto commented 8 years ago

Using pcol(x)[, "foo"] is not as easy to type as x$foo but would not confuse with classical data.frame subsetting and would be similar to MSnbase's fData(x)["foo"].

I always use fData(x)$foo, but let's forget about the $ for now.

Yes, we only need a pranges<- replacement method for this. Not sure about pcols - it is perhaps a confusing to filter on the elements metadata to filter the actual peptide ranges.

Re OP, my idea is that if we have general sub-setting capabilities, that question and many others can be resolved easily.

lgatto commented 8 years ago

Yes, we only need a pranges<- replacement method for this. Not sure about pcols - it is perhaps a confusing to filter on the elements metadata to filter the actual peptide ranges.

Maybe pfeatures<-?

lgatto commented 8 years ago

@sgibb - could you have a look at commit 7583f240e90642092937ef72a6a8bfebd235914c.

lgatto commented 8 years ago

I think it would be nice to have pcols(x)[, "FOO"] <- x, to record delta for example. And I guess something line acols(x)[, "BAR"] <- ... would also be nice for consistency. What do you thing?

sgibb commented 8 years ago

https://github.com/ComputationalProteomicsUnit/Pbase/commit/7583f240e90642092937ef72a6a8bfebd235914c looks fine but we have to ensure that names(pranges) == names(aa).

acol<- would be a good idea, too.

lgatto commented 8 years ago

I have added acols<-. Still need to write tests, though.

sgibb commented 8 years ago

closed in bf42895e918328c3c11708852d8a47a9a00fb6ba and 7a2375d2037704535e4c4d435cd66a0fbd250a0d.

lgatto commented 8 years ago

Thanks!

lgatto / Pbase

pcols accessor and replacement method #25