rformassspectrometry / QFeatures

Quantitative features for mass spectrometry data
https://RforMassSpectrometry.github.io/QFeatures/
25 stars 7 forks source link

Issue199 #202

Closed cvanderaa closed 8 months ago

cvanderaa commented 8 months ago

cf #200

I took here a few liberties:

What is still misssing

lgatto commented 8 months ago

names = "psms" is not a good idea (eg "precursor" is more relevant to DIA), I suggest to use "quants" instead, but I'm open to discussion.

I dropped automatic naming of the rownames as paste0("PSM", seq_len(nrow(assayData)), this was ported from readSCP() and brings more noise (cf point above) than benefits.

But this then also drops/ignores the requirement for rownames uniqueness. When aggregating, rownames are added back, and are unique (by definition). Why drop this for the 1st assays? What noise? This would require much more thorough testing.

cvanderaa commented 8 months ago

I don't like quants, because it refers to any level; fine with psms or precursors or precs. We should remain domain specific, so that users can relate to the more abstract data structures.

The readQFeatures() now allows to read data at any level, right? For the single-set use case (see 1. in your comment here https://github.com/rformassspectrometry/QFeatures/pull/200#issuecomment-1986862723), the data could be psm, peptide or protein. But I agree that quants may sound too generic.

But this then also drops/ignores the requirement for rownames uniqueness. When aggregating, rownames are added back, and are unique (by definition). Why drop this for the 1st assays? What noise? This would require much more thorough testing.

The rownames remain unique and are defined by a unique index number as it has been the case for readQFeatures() before the refactoring. In other words, I now dropped the PSM part of the naming, but I still enforce that the rownames are unique (see here https://github.com/cvanderaa/QFeatures/blob/253c81e4ef9411719713e1373936619fc07b3e62/R/QFeatures-constructors.R#L252). I think it brings noise/confusion in our current implementation as the feature may now not only be PSM but also peptides or proteins.

lgatto commented 8 months ago

Re rownames, I think we should avoid (for now at least, just before a release) situations like this (modified from ?QFeatures), even if it's a valid object.

> m1 <- matrix(1:40, ncol = 4)
> m2 <- matrix(1:16, ncol = 4)
> sample_names <- paste0("S", 1:4)
> colnames(m1) <- colnames(m2) <- sample_names
> ## rownames(m1) <- c("a", letters[1:9]) ## NO ROW NAMES 
> rownames(m2) <- letters[1:4]
> df1 <- DataFrame(Fa = 1:10, Fb = letters[1:10],
+ row.names = rownames(m1))
> df2 <- DataFrame(row.names = rownames(m2))
> (se1 <- SummarizedExperiment(m1, df1))
class: SummarizedExperiment 
dim: 10 4 
metadata(0):
assays(1): ''
rownames: NULL
rowData names(2): Fa Fb
colnames(4): S1 S2 S3 S4
colData names(0):
> (se2 <- SummarizedExperiment(m2, df2))
class: SummarizedExperiment 
dim: 4 4 
metadata(0):
assays(1): ''
rownames(4): a b c d
rowData names(0):
colnames(4): S1 S2 S3 S4
colData names(0):
> cd <- DataFrame(Var1 = rnorm(4),
+ Var2 = LETTERS[1:4],
+ row.names = sample_names)
> el <- list(assay1 = se1, assay2 = se2)
> fts1 <- QFeatures(el, colData = cd)  
> rownames(fts1)
CharacterList of length 2
[["assay1"]] character(0)
[["assay2"]] a b c d
cvanderaa commented 8 months ago

Ok I see what you mean, but this situation should never happen with readQFeatures() since regardless of the use case, the function always calls readSummarizedExperiment() which always assigns rownames (see here)