Extending PSM class - Githubissues

lgatto commented 3 years ago

[x] Based on a discussion with @cvanderaa @thomasburger and @samWieczorek, the PSM class is going to be extended to include slots to store the adjacency and the ~related~ connected component matrices.
[X] If we want to store these as sparse matrices, then makeAdjacencyMatrix() should first be modified to compute a spare adjacency matrix.

lgatto commented 3 years ago

Note: using sparse matrices (and more generally Matrix objects) leads to much faster code.

lgatto commented 3 years ago

I can see two options for now:

Keep class as it is (i.e. a DFrame) and add the matrices in the metadata slot. This has the advantage that the current code can stay as it is. Conceptually, if the class is called PSM, the adjacency and ~related~ connected component matrices can indeed be considered metadata to the PSM table. A validity method could check that these are metadata element are indeed of the correct class. (i.e. inherit from Matrix) and possibly has the correct rows/cols (see below). We would have setters and getters to manage these metadata elements (as we have now for the PSM variables and reduced element).
Alternatively, we could update the class (or create a new class, possibly called PSMAnalysis?) that would have as slots the two matrices and the (current) PSM object (and possibly @reduced and @[psm]variables that are currently stored as metadata). This would require some non-trivial adapatation of the current code and also require to implement the data.frame interface (names, dim, nrow, ncol, [, $) and coersion to DataFrame and data.frame.

Whatever the option, the adjacency and ~related~ connected component matrices shouldn't be computed immediately, as when creating the PSM object, additional filtering is likely to be done (when created from mzid files). In addition, some validity check should probably be added once these matrices are created, that verifies that the petides and proteins in the PSMs and matrices are identical: would be make sense to have these out of sync, by filtering the PSMs after the creation of the matrices? Or would we need to update the matrices accordingly?

lgatto commented 2 years ago

Another point to consider is that we might consider the use of multiple adjacency matrices. This example, taken from the most recent makeAdjacencyMatrix() man page illustrated the case with a first matrix generated from the PSMs (and multiple PSMs for a peptides) and the same result after reduction (or setting the binary argument):

psmdf <- data.frame(psm = paste0("psm", 1:10),
                    peptide = paste0("pep", c(1, 1, 2, 2, 3, 4, 6, 7, 8, 8)),
                    protein = paste0("Prot", LETTERS[c(1, 1, 2, 2, 3, 4, 3, 5, 6, 6)]))
psmdf
#>      psm peptide protein
#> 1   psm1    pep1   ProtA
#> 2   psm2    pep1   ProtA
#> 3   psm3    pep2   ProtB
#> 4   psm4    pep2   ProtB
#> 5   psm5    pep3   ProtC
#> 6   psm6    pep4   ProtD
#> 7   psm7    pep6   ProtC
#> 8   psm8    pep7   ProtE
#> 9   psm9    pep8   ProtF
#> 10 psm10    pep8   ProtF
psm <- PSM(psmdf, peptide = "peptide", protein = "protein")
psm
#> PSM with 10 rows and 3 columns.
#> names(3): psm peptide protein
makeAdjacencyMatrix(psm)
#> 7 x 6 sparse Matrix of class "dgCMatrix"
#>      ProtA ProtB ProtC ProtD ProtE ProtF
#> pep1     2     .     .     .     .     .
#> pep2     .     2     .     .     .     .
#> pep3     .     .     1     .     .     .
#> pep4     .     .     .     1     .     .
#> pep6     .     .     1     .     .     .
#> pep7     .     .     .     .     1     .
#> pep8     .     .     .     .     .     2

## Reduce PSM object to peptides
rpsm <- reducePSMs(psm, k = psm$peptide)
rpsm
#> Reduced PSM with 7 rows and 3 columns.
#> names(3): psm peptide protein
makeAdjacencyMatrix(rpsm)
#> 7 x 6 sparse Matrix of class "dgCMatrix"
#>      ProtA ProtB ProtC ProtD ProtE ProtF
#> pep1     1     .     .     .     .     .
#> pep2     .     1     .     .     .     .
#> pep3     .     .     1     .     .     .
#> pep4     .     .     .     1     .     .
#> pep6     .     .     1     .     .     .
#> pep7     .     .     .     .     1     .
#> pep8     .     .     .     .     .     1

## Or set binary to TRUE
makeAdjacencyMatrix(psm, binary = TRUE)
#> 7 x 6 sparse Matrix of class "dgCMatrix"
#>      ProtA ProtB ProtC ProtD ProtE ProtF
#> pep1     1     .     .     .     .     .
#> pep2     .     1     .     .     .     .
#> pep3     .     .     1     .     .     .
#> pep4     .     .     .     1     .     .
#> pep6     .     .     1     .     .     .
#> pep7     .     .     .     .     1     .
#> pep8     .     .     .     .     .     1

Another example leading to multiple adjacency matrices would be following multiple PSM filtering steps, which is mentioned in the message above and would lead to possible inconsistencies between the PSM table and the matrices.

On extension could involve a list of adjacency matrices (and this also component matrices), so that all matrices of interest could be stored. Something along the lines of

setClass("PSMAnalysis", 
   slots = c(adjacencyList = "List", 
             connectedComponentList = "List", 
             psmTable = "PSM")

lgatto commented 2 years ago

Closing for now:

adjacency matrices can be created from various inputs
ConnectedComponent class is ready
No need or desire (for now) to put everything in a single class - this would anyway complexify the whole code base as different components would need to be kept in sync (for example update the adjacency matrix and CCs every time the PSM are filtered)

rformassspectrometry / PSMatch

Extending PSM class #8