Closed cvanderaa closed 8 months ago
names = "psms" is not a good idea (eg "precursor" is more relevant to DIA), I suggest to use "quants" instead, but I'm open to discussion.
I dropped automatic naming of the rownames as
paste0("PSM", seq_len(nrow(assayData))
, this was ported from readSCP() and brings more noise (cf point above) than benefits.
But this then also drops/ignores the requirement for rownames uniqueness. When aggregating, rownames are added back, and are unique (by definition). Why drop this for the 1st assays? What noise? This would require much more thorough testing.
I don't like quants, because it refers to any level; fine with psms or precursors or precs. We should remain domain specific, so that users can relate to the more abstract data structures.
The readQFeatures()
now allows to read data at any level, right? For the single-set use case (see 1. in your comment here https://github.com/rformassspectrometry/QFeatures/pull/200#issuecomment-1986862723), the data could be psm, peptide or protein. But I agree that quants
may sound too generic.
But this then also drops/ignores the requirement for rownames uniqueness. When aggregating, rownames are added back, and are unique (by definition). Why drop this for the 1st assays? What noise? This would require much more thorough testing.
The rownames remain unique and are defined by a unique index number as it has been the case for readQFeatures()
before the refactoring. In other words, I now dropped the PSM
part of the naming, but I still enforce that the rownames are unique (see here https://github.com/cvanderaa/QFeatures/blob/253c81e4ef9411719713e1373936619fc07b3e62/R/QFeatures-constructors.R#L252). I think it brings noise/confusion in our current implementation as the feature may now not only be PSM but also peptides or proteins.
Re rownames, I think we should avoid (for now at least, just before a release) situations like this (modified from ?QFeatures
), even if it's a valid object.
> m1 <- matrix(1:40, ncol = 4)
> m2 <- matrix(1:16, ncol = 4)
> sample_names <- paste0("S", 1:4)
> colnames(m1) <- colnames(m2) <- sample_names
> ## rownames(m1) <- c("a", letters[1:9]) ## NO ROW NAMES
> rownames(m2) <- letters[1:4]
> df1 <- DataFrame(Fa = 1:10, Fb = letters[1:10],
+ row.names = rownames(m1))
> df2 <- DataFrame(row.names = rownames(m2))
> (se1 <- SummarizedExperiment(m1, df1))
class: SummarizedExperiment
dim: 10 4
metadata(0):
assays(1): ''
rownames: NULL
rowData names(2): Fa Fb
colnames(4): S1 S2 S3 S4
colData names(0):
> (se2 <- SummarizedExperiment(m2, df2))
class: SummarizedExperiment
dim: 4 4
metadata(0):
assays(1): ''
rownames(4): a b c d
rowData names(0):
colnames(4): S1 S2 S3 S4
colData names(0):
> cd <- DataFrame(Var1 = rnorm(4),
+ Var2 = LETTERS[1:4],
+ row.names = sample_names)
> el <- list(assay1 = se1, assay2 = se2)
> fts1 <- QFeatures(el, colData = cd)
> rownames(fts1)
CharacterList of length 2
[["assay1"]] character(0)
[["assay2"]] a b c d
cf #200
I took here a few liberties:
paste0("PSM", seq_len(nrow(assayData))
, this was ported from readSCP() and brings more noise (cf point above) than benefits.sep
argument for creating sample names, it unecessraily complicates the readQFeatures interface. I implemented it in readSCP for LFQ experiment, but sample naming is now smartly handled to adapt to LFQ designs (i.e. when only 1quantCols
).sep
, the functionality to provide suffix for sample naming is error prone in case colAnnation is reordered. I removed it and let the function smartly generate sample names.NA
. The corresponding samples can be removed using subsequently usingsubsetByColData()
What is still misssing
readQFeatures
, but we could also unit test the new internal functions for safetyassayData
tables