Issue199 - Githubissues

cvanderaa commented 8 months ago

cf #200

I took here a few liberties:

names = "psms" is not a good idea (eg "precursor" is more relevant to DIA), I suggest to use "quants" instead, but I'm open to discussion.
I dropped automatic naming of the rownames as paste0("PSM", seq_len(nrow(assayData)), this was ported from readSCP() and brings more noise (cf point above) than benefits.
I removed sep argument for creating sample names, it unecessraily complicates the readQFeatures interface. I implemented it in readSCP for LFQ experiment, but sample naming is now smartly handled to adapt to LFQ designs (i.e. when only 1 quantCols).
Similarly to sep, the functionality to provide suffix for sample naming is error prone in case colAnnation is reordered. I removed it and let the function smartly generate sample names.
I no longer removing feature for which there is no sample annotation. Instead, missing sample annotations are filled with NA. The corresponding samples can be removed using subsequently using subsetByColData()

What is still misssing

I thoroughly unit tested readQFeatures, but we could also unit test the new internal functions for safety
Documentation (man + vignette) still needs to be adapted
readQFeaturesFromDIANN still needs to be refactored, that is create a function that converts DIA-NN report tables in the expected assayData tables

lgatto commented 8 months ago

names = "psms" is not a good idea (eg "precursor" is more relevant to DIA), I suggest to use "quants" instead, but I'm open to discussion.

I don't like quants, because it refers to any level; fine with psms or precursors or precs. We should remain domain specific, so that users can relate to the more abstract data structures.

I dropped automatic naming of the rownames as paste0("PSM", seq_len(nrow(assayData)), this was ported from readSCP() and brings more noise (cf point above) than benefits.

But this then also drops/ignores the requirement for rownames uniqueness. When aggregating, rownames are added back, and are unique (by definition). Why drop this for the 1st assays? What noise? This would require much more thorough testing.

cvanderaa commented 8 months ago

I don't like quants, because it refers to any level; fine with psms or precursors or precs. We should remain domain specific, so that users can relate to the more abstract data structures.

The readQFeatures() now allows to read data at any level, right? For the single-set use case (see 1. in your comment here https://github.com/rformassspectrometry/QFeatures/pull/200#issuecomment-1986862723), the data could be psm, peptide or protein. But I agree that quants may sound too generic.

But this then also drops/ignores the requirement for rownames uniqueness. When aggregating, rownames are added back, and are unique (by definition). Why drop this for the 1st assays? What noise? This would require much more thorough testing.

The rownames remain unique and are defined by a unique index number as it has been the case for readQFeatures() before the refactoring. In other words, I now dropped the PSM part of the naming, but I still enforce that the rownames are unique (see here https://github.com/cvanderaa/QFeatures/blob/253c81e4ef9411719713e1373936619fc07b3e62/R/QFeatures-constructors.R#L252). I think it brings noise/confusion in our current implementation as the feature may now not only be PSM but also peptides or proteins.

lgatto commented 8 months ago

Re rownames, I think we should avoid (for now at least, just before a release) situations like this (modified from ?QFeatures), even if it's a valid object.

> m1 <- matrix(1:40, ncol = 4)
> m2 <- matrix(1:16, ncol = 4)
> sample_names <- paste0("S", 1:4)
> colnames(m1) <- colnames(m2) <- sample_names
> ## rownames(m1) <- c("a", letters[1:9]) ## NO ROW NAMES 
> rownames(m2) <- letters[1:4]
> df1 <- DataFrame(Fa = 1:10, Fb = letters[1:10],
+ row.names = rownames(m1))
> df2 <- DataFrame(row.names = rownames(m2))
> (se1 <- SummarizedExperiment(m1, df1))
class: SummarizedExperiment 
dim: 10 4 
metadata(0):
assays(1): ''
rownames: NULL
rowData names(2): Fa Fb
colnames(4): S1 S2 S3 S4
colData names(0):
> (se2 <- SummarizedExperiment(m2, df2))
class: SummarizedExperiment 
dim: 4 4 
metadata(0):
assays(1): ''
rownames(4): a b c d
rowData names(0):
colnames(4): S1 S2 S3 S4
colData names(0):
> cd <- DataFrame(Var1 = rnorm(4),
+ Var2 = LETTERS[1:4],
+ row.names = sample_names)
> el <- list(assay1 = se1, assay2 = se2)
> fts1 <- QFeatures(el, colData = cd)  
> rownames(fts1)
CharacterList of length 2
[["assay1"]] character(0)
[["assay2"]] a b c d

cvanderaa commented 8 months ago

Ok I see what you mean, but this situation should never happen with readQFeatures() since regardless of the use case, the function always calls readSummarizedExperiment() which always assigns rownames (see here)

rformassspectrometry / QFeatures

Issue199 #202

What is still misssing