Closed lgatto closed 2 years ago
Note: using sparse matrices (and more generally Matrix
objects) leads to much faster code.
I can see two options for now:
Keep class as it is (i.e. a DFrame
) and add the matrices in the metadata slot. This has the advantage that the current code can stay as it is. Conceptually, if the class is called PSM
, the adjacency and ~related~ connected component matrices can indeed be considered metadata to the PSM table. A validity method could check that these are metadata element are indeed of the correct class. (i.e. inherit from Matrix) and possibly has the correct rows/cols (see below). We would have setters and getters to manage these metadata elements (as we have now for the PSM variables and reduced element).
Alternatively, we could update the class (or create a new class, possibly called PSMAnalysis
?) that would have as slots the two matrices and the (current) PSM object (and possibly @reduced
and @[psm]variables
that are currently stored as metadata). This would require some non-trivial adapatation of the current code and also require to implement the data.frame interface (names
, dim
, nrow
, ncol
, [
, $
) and coersion to DataFrame and data.frame.
Whatever the option, the adjacency and ~related~ connected component matrices shouldn't be computed immediately, as when creating the PSM object, additional filtering is likely to be done (when created from mzid files). In addition, some validity check should probably be added once these matrices are created, that verifies that the petides and proteins in the PSMs and matrices are identical: would be make sense to have these out of sync, by filtering the PSMs after the creation of the matrices? Or would we need to update the matrices accordingly?
Another point to consider is that we might consider the use of multiple adjacency matrices. This example, taken from the most recent makeAdjacencyMatrix()
man page illustrated the case with a first matrix generated from the PSMs (and multiple PSMs for a peptides) and the same result after reduction (or setting the binary
argument):
psmdf <- data.frame(psm = paste0("psm", 1:10),
peptide = paste0("pep", c(1, 1, 2, 2, 3, 4, 6, 7, 8, 8)),
protein = paste0("Prot", LETTERS[c(1, 1, 2, 2, 3, 4, 3, 5, 6, 6)]))
psmdf
#> psm peptide protein
#> 1 psm1 pep1 ProtA
#> 2 psm2 pep1 ProtA
#> 3 psm3 pep2 ProtB
#> 4 psm4 pep2 ProtB
#> 5 psm5 pep3 ProtC
#> 6 psm6 pep4 ProtD
#> 7 psm7 pep6 ProtC
#> 8 psm8 pep7 ProtE
#> 9 psm9 pep8 ProtF
#> 10 psm10 pep8 ProtF
psm <- PSM(psmdf, peptide = "peptide", protein = "protein")
psm
#> PSM with 10 rows and 3 columns.
#> names(3): psm peptide protein
makeAdjacencyMatrix(psm)
#> 7 x 6 sparse Matrix of class "dgCMatrix"
#> ProtA ProtB ProtC ProtD ProtE ProtF
#> pep1 2 . . . . .
#> pep2 . 2 . . . .
#> pep3 . . 1 . . .
#> pep4 . . . 1 . .
#> pep6 . . 1 . . .
#> pep7 . . . . 1 .
#> pep8 . . . . . 2
## Reduce PSM object to peptides
rpsm <- reducePSMs(psm, k = psm$peptide)
rpsm
#> Reduced PSM with 7 rows and 3 columns.
#> names(3): psm peptide protein
makeAdjacencyMatrix(rpsm)
#> 7 x 6 sparse Matrix of class "dgCMatrix"
#> ProtA ProtB ProtC ProtD ProtE ProtF
#> pep1 1 . . . . .
#> pep2 . 1 . . . .
#> pep3 . . 1 . . .
#> pep4 . . . 1 . .
#> pep6 . . 1 . . .
#> pep7 . . . . 1 .
#> pep8 . . . . . 1
## Or set binary to TRUE
makeAdjacencyMatrix(psm, binary = TRUE)
#> 7 x 6 sparse Matrix of class "dgCMatrix"
#> ProtA ProtB ProtC ProtD ProtE ProtF
#> pep1 1 . . . . .
#> pep2 . 1 . . . .
#> pep3 . . 1 . . .
#> pep4 . . . 1 . .
#> pep6 . . 1 . . .
#> pep7 . . . . 1 .
#> pep8 . . . . . 1
Another example leading to multiple adjacency matrices would be following multiple PSM filtering steps, which is mentioned in the message above and would lead to possible inconsistencies between the PSM table and the matrices.
On extension could involve a list of adjacency matrices (and this also component matrices), so that all matrices of interest could be stored. Something along the lines of
setClass("PSMAnalysis",
slots = c(adjacencyList = "List",
connectedComponentList = "List",
psmTable = "PSM")
Closing for now:
ConnectedComponent
class is ready
[x] Based on a discussion with @cvanderaa @thomasburger and @samWieczorek, the
PSM
class is going to be extended to include slots to store the adjacency and the ~related~ connected component matrices.[X] If we want to store these as sparse matrices, then
makeAdjacencyMatrix()
should first be modified to compute a spare adjacency matrix.