lgatto / Pbase

Manipluating and exploring protein and proteomics data
8 stars 3 forks source link

RFC: changing Proteins class #38

Open lgatto opened 7 years ago

lgatto commented 7 years ago

The issue

Currently, we don't do a good job supporting multiple types of pranges. We have on one side an example were pranges store peptides along proteins. With the new EnsDb access to protein data, one can easily add other information along the protein sequences, such as protein domain (see thePbase-with-ensembldb vignette). But, there is no good way to combine these different ranges into a single Proteins object.

Suggested solution

Make the current pranges (Compressed)IRangesList a column of mcols(x@aa). This then allows to easily add additional IRangesList (basically extending mcols(x@aa) or acols(x)).

Suggested implementation

setClass("Proteins2",
         contains = "Versioned",
         slots = list(metadata = "list",
                      aa = "AAStringSet"),
         prototype = prototype(
             new("Versioned",
                 versions = c(Proteins = "0.2"))))

.get_pranges_indices <-
    function(x) sapply(mcols(np@aa), inherits, "IRangesList")

pranges <- function(x) { ## always returns a DataFrame 
    i <- .get_pranges_indices(x)
    mcols(np@aa)[i]
}

acols <- function(x) {
    i <- .get_pranges_indices(x)
    mcols(np@aa)[!i]
}

avarLabels <- function(x) {
    i <- .get_pranges_indices(x)
    names(mcols(np@aa))[!i]
}

pvarLabels <- function(x) {
    i <- .get_pranges_indices(x)
    names(mcols(np@aa))[i]
}

setMethod("seqnames", "Proteins2",
          function(x) names(x@aa))

setMethod("length", "Proteins2",
          function(x) length(x@aa))

setMethod("show", "Proteins2",
          function (object) {
              topics <- c("S4 class type", "Class version", "Created", 
                          "Number of Proteins")
              topics <- format(topics, justify = "left")
              n <- length(object)
              values <- c(class(object),
                          tail(as(classVersion(object), "character"), 1L),
                          object@metadata$created, n)
              cat(paste0(topics, ": ", values, collapse = "\n"), sep = "\n")
              if (length(object) > 0) {
                  sn <- seqnames(object)
                  ln <- length(object)
                  cat("Sequences:\n  ")
                  Pbase:::htcat(seqnames(object), n = 2)
                  cat("Protein ranges:")
                  pvl <- pvarLabels(object)
                  if (length(pvl)) {
                      cat("\n  ")
                      Pbase:::htcat(pvarLabels(object), n = 2)
                  } else cat(" None\n")
              }
          })

Example

library("Pbase")
data(p)

.aa <- p@aa
mcols(.aa) <- DataFrame(mcols(.aa), pranges(p))
p2 <- new("Proteins2",
          metadata = metadata(p),
          aa = .aa)
> p2
S4 class type     : Proteins2
Class version     : 0.2
Created           : Sat Feb 14 12:36:40 2015
Number of Proteins: 9
Sequences:
  [1] A4UGR9 [2] A6H8Y1 ... [8] P04075-2 [9] P60709
Protein ranges:
  peptides
Error: could not find function "pragnes"
> pranges(p2)
DataFrame with 9 rows and 1 column
                                    peptides
                               <IRangesList>
1 [2743, 2760] [ 307,  318] [1858, 1870] ...
2 [ 448,  465] [1284, 1291] [1120, 1128] ...
3       [ 51,  65] [495, 512] [438, 450] ...
4 [ 895,  909] [1746, 1757] [2563, 2578] ...
5       [193, 206] [268, 275] [333, 350] ...
6       [  1,  11] [332, 349] [367, 377] ...
7       [112, 134] [112, 134] [154, 173] ...
8       [166, 188] [166, 188] [208, 227] ...
9                                 [184, 196]
> acols(p2)
DataFrame with 9 rows and 13 columns
     DB AccessionNumber   EntryName IsoformName
  <Rle>     <character> <character>       <Rle>
1    sp          A4UGR9 XIRP2_HUMAN          NA
2    sp          A6H8Y1  BDP1_HUMAN          NA
3    sp          O43707 ACTN4_HUMAN          NA
4    sp          O75369  FLNB_HUMAN          NA
5    sp          P00558  PGK1_HUMAN          NA
6    sp          P02545  LMNA_HUMAN          NA
                                        ProteinName OrganismName GeneName
                                        <character>        <Rle>    <Rle>
1     Xin actin-binding repeat-containing protein 2 Homo sapiens    XIRP2
2 Transcription factor TFIIIB component B'' homolog Homo sapiens     BDP1
3                                   Alpha-actinin-4 Homo sapiens    ACTN4
4                                         Filamin-B Homo sapiens     FLNB
5                         Phosphoglycerate kinase 1 Homo sapiens     PGK1
6                                      Prelamin-A/C Homo sapiens     LMNA
           ProteinExistence SequenceVersion Comment
                      <Rle>           <Rle>   <Rle>
1 Evidence at protein level               2      NA
2 Evidence at protein level               3      NA
3 Evidence at protein level               2      NA
4 Evidence at protein level               2      NA
5 Evidence at protein level               3      NA
6 Evidence at protein level               1      NA
                                 Filename     npeps            ENST
                                    <Rle> <integer>     <character>
1 ../extdata/HUMAN_2015_02_selected.fasta        36 ENST00000409195
2 ../extdata/HUMAN_2015_02_selected.fasta        23 ENST00000358731
3 ../extdata/HUMAN_2015_02_selected.fasta         6 ENST00000252699
4 ../extdata/HUMAN_2015_02_selected.fasta        13 ENST00000295956
5 ../extdata/HUMAN_2015_02_selected.fasta         5 ENST00000373316
6 ../extdata/HUMAN_2015_02_selected.fasta        12 ENST00000368300
 [ reached getOption("max.print") -- omitted 3 rows ]
> avarLabels(p2)
 [1] "DB"               "AccessionNumber"  "EntryName"        "IsoformName"     
 [5] "ProteinName"      "OrganismName"     "GeneName"         "ProteinExistence"
 [9] "SequenceVersion"  "Comment"          "Filename"         "npeps"           
[13] "ENST"            
> pvarLabels(p2)
[1] "peptides"

Expected new functionality

Adding ranges

> p2 <- addPrange(p2, domains = anIRangesList)
> p2
S4 class type     : Proteins2
Class version     : 0.2
Created           : Sat Feb 14 12:36:40 2015
Number of Proteins: 9
Sequences:
  [1] A4UGR9 [2] A6H8Y1 ... [8] P04075-2 [9] P60709
Protein ranges:
  [1] peptides [2] domains

Any comments/suggestions @ComputationalProteomicsUnit/developer-pbase

lgatto commented 7 years ago

Also, as a side-effect, I think that this new implementation is simpler and will lead to simplifications in the code.

sgibb commented 7 years ago

I agree that the current implementation is very limited. I really like your idea to support multiple pranges but I don't like using mcols(x@aa) for it. It would break the line-by-line relation of sequences and their metadata. Also the use of .get_pranges_indices seems error-prone. pranges is a central feature in Pbase that's why I think it earns an own slot and we could turn the pranges slot into an environment or a list. This would be similar to your suggestion but without the need for .get_pranges_indices.

BTW: Do we need an CPU R style guide? I personally don't like the Hadley's snake_case. In general we use the Bioc Coding Style, or?

sgibb commented 7 years ago

@lgatto what kind of code simplification do you mean? Regardless of an own pranges slot or as additional element in mcols(x@aa) we need to ensure the order and naming of the (Compressed)IRangesList. The only simplification I could think of would be simpler subsetting operators (because we don't have to subset x@aa and x@pranges but IMHO that is negligible).

lgatto commented 7 years ago

The code simplifications is that all is contained into a single DataFrame. This is a well-defined/written data structure maintained by Bioc-core. Having the AAStringSet's metadata columns and and the pranges as columns takes care of many of the validity implicitly and addition of new pranges. For example, adding a new pranges IRangesList will always match the number of row/proteins. And, as you say, subsetting becomes trivial. We still need to make sure names match, of course, but that's easy too:

stopifnot(all.equal(names(newPranges), names(object))
object@mcols <- cbind(object@mcols, DataFrame(newPranges))
## or, of more efficient
object@mcols[, "newPrangeName"] <- newPranges

I agree that pranges are important; it's the most important feature of the class/package. But I don't think this needs to be reflected in the implementation. Having an easy way (code) to discriminate the pranges columns and aa's metadata exposes this importance.

I think that the simple .get_pranges_indices is actually quite good. It's simple (that's very good) and I believe the class of the column is a good way to define it it's a valid prange - could you think of a case where a sequence metadata would be a IRangesList other than describing ranges for each sequence? This is the definition of a prange. It would also be possible to have out own PRangesList class that is a re-branded IRangesList to avoid to rely on the above assumption.

I don't understand this

It would break the line-by-line relation of sequences and their metadata.

on the contrary, it enforces it by design for the sequence metadata and pranges. Each pranges have their own metadata, of course.

Also, I don't agree with the suggestion of having them in a list or environment. A DataFrame is the prefect data structure for this. We could have 2 DataFrames, with the same row names, but then, why not combine them?

sgibb commented 7 years ago

I understand your point and fully agree that using robust and heavily tested bioc classes is better than inventing our own data structure. In my understanding a (Annotated)DataFrame is a table like structure and each row correspond to a single sequence (that was my intention of line-by-line relation). Adding a (Compressed)IRangesList as a DataFrame column would create a row (the corresponding IRangesList element) with multiple entries (IRanges). But in fact a data.frame is just a named list so here is no real point to argue about. I am fine with your suggestion.

lgatto commented 7 years ago

I am confused when you say

Adding a (Compressed)IRangesList as a DataFrame column would create a row (the corresponding IRangesList element) with multiple entries (IRanges).

Here is how I see it: adding an IRangesList to the DataFrame corresponds to adding a columns (an item to the list, as you point out, with the right length), which, from our point of view, adds an IRanges to each row/protein. This exactly match your line-by-line relation with each row corresponding to a single sequence.

proteins mcols IRangesList IRangesList
Sequence_1 .... IRranges_1 IRranges_1
Sequence_2 .... IRranges_2 IRranges_2
Sequence_n .... IRranges_n IRranges_n
sgibb commented 7 years ago

Sorry for causing confusion, I meant:

proteins mcols IRangesList IRangesList
Sequence_1 .... IRranges_1 IRranges_1
- Range1 1 - Range1 1
- Range1 2 - Range1 2
- Range1 n - Range1 n
Sequence_2 .... IRranges_2 IRranges_2
Sequence_n .... IRranges_n IRranges_n

But that is not important. It was just my own confusion about the representation of a tree-like list structure and a table-like data.frame.

After thinking twice I totally agree with your approach.

jorainer commented 7 years ago

good that you did all the discussion already :) I like the idea very much of putting the IRangesList into the mcols DataFrame - no more checking that the mapping protein to ranges fits. Did you check the code already in? Did you test adding additional pranges?

lgatto commented 7 years ago

Going to push soon. Before adding code to add pranges, I'll make sure that the old code base works.

lgatto commented 7 years ago

I will also document here other updates related to the new Proteins implementation:

See also #41 for more details about the API

jorainer commented 7 years ago

As an info: since https://github.com/ComputationalProteomicsUnit/Pbase/commit/12749cdfae727e9000294fb20b82a46323c1ae6c the Proteins,EnsDb method fetches proteins and puts the protein domains into the mcols(aa(x)). Works nicely. Like that.