Closed lgatto closed 8 months ago
After careful inspection of the new code I would like to rediscuss the some of the arguments (colAnnot
, ecols
, channelCol
and batchCol
). Currently, the documentation describes 2 cases for readQFeatures()
:
The user will use the following signature if they provide ecols
(backward compatibility):
c("readQFeatures", c("data.frame", "missing"))
or better
c("readQFeatures", c("data.frame", "vector"))
Problem: users cannot provide a sample annotation table in the single-set case.
The user will use the following signature
c("readQFeatures", c("data.frame", "data.frame"))
Problem: "lazy" users that don't have sample annotations are stuck. One argument could be that we promote good practice and enforce users to make a sample annotation table, but then, we must enforce it as well in the single-set case. Also, it sounds a bit confusing that colAnnot
can refer to a table of sample annotations or a vector of column name/indices, which are different concepts to me.
colAnnot
should always be a table (data.frame and friends), but the argument is optional.ecols
(single-set case) and channelCol
(multiset case) as a single argument (that I will tentatively call ecols2
for clarity).
ecols2
is a vector and colAnnot
is missing: use it to select quantitative columns in assayData
(same as ecols
)ecols2
is a vector and colAnnot
is provided: use it to select quantitative columns in assayData
(same as ecols
)ecols2
is a character(1)
and colAnnot
is provided: use it to select the quantitative column in assayData
(same as ecols
). If no match, use ecols2
to match a column in colAnnot
that provide the names of the quantitative columns (same as channelCol
)ecols2
can never be missing!batchCol
as follows:
batchCol
is a vector (stopifnot(length(batchCol) == nrow(assayData))
), use that vector to split the SE. batchCol
is a character(1)
and colAnnot
is provided, use it to retrieve the sample annotation column to use for splitting the SE.batchCol
is missing, don't split (=single-set case).I think implementing this will enable more flexible use cases, without a loss of the current capabilities.
QFeatures
object with or without colData
.colAnnot
table. This could facilitate reading LFQ data where a channelCol
makes no sense. I will implement this, so to show my ideas in action, but I'm happy to already hear your thoughts on that.
I've addressed all comments and changes. Any idea what will break once this is merged? What about vignettes in
SCP.replication
?
Since the behavior of the functionality did not change and there is backward compatibility, I don't think anything will break as long as we adapt readSCP()
as a wrapper function as you suggested in #199
Since the behavior of the functionality did not change and there is backward compatibility, I don't think anything will break as long as we adapt readSCP() as a wrapper function as you suggested in #199.
But some arguments have changed (for now at least), such as batchCol to runCol, which will lead to errors.
There are some aspects I like, others that I'm unsure to understand. But here are a first set fo comments, and I'll think more about your suggestion.
Problem: users cannot provide a sample annotation table in the single-set case.
Yes, they can, by setting a colAnnotation that defines a single run/batch. See the 'Single-set case using a data.frame' example.
Also, it sounds a bit confusing that colAnnot can refer to a table of sample annotations or a vector of column name/indices, which are different concepts to me.
These are two different ways to indicate annotate columns, so not
entirely different, but still confusing I agree. My thinking was to
steer away from ecol
and promote its replacement by a
colAnnotation
table.
Your suggestion is the opposite, i.e. to always use ecol
, and only
use colAnnotation
table to add a colData
to the final QFeatures
.
In your case 2,
- When
ecols2
is a vector andcolAnnot
is provided: use it to select quantitative columns inassayData
(same asecols
)
I assume here colAnnot
is used to populate the colData
of the
returned QFeatures
object?
And what about the following, that simplifies you suggestion (I think). Here the user uses either a colAnnotation data.frame, or one or two arguments.
The user must always provide the ecol
vector with variables names
or indices or a logical to identify what columns are quantitative
data. This defines de vertical splitting of the assay and the row
data.
If only ecol
of provided, and the second vector runCol
is
missing (or NULL
), we are in the single-set scenario.
If runCol
is also provided, it must be a numeric(1)
or a
character(1)
that gives the column name that defines the
horizontal splitting. The presence of runCol
defines the multi-set
case.
When colAnnotation
is present, ecol
and runCol
must either be
missing (or both NULL
). colAnnotation
must be a data.frame
or
an object that can be coerced to a data.frame
.
colAnnotation
must at least contain a column names ecol
that
with the names of the samples, i.e. table column names used for
vertical splitting of the data. This is equivalent of using the
ecol
argument, except it is passed in a data.frame
.
colAnnotation
can also contain a runCol
column that defines the
horizontal splitting of the input table.
If colAnnotation
only contains ecol
, we are in the single-set
case, and nrow(colAnnotation)
corresponds to the number of
samples.
If colAnnotation
contains ecol
and runCol
, we are in the
multi-set case, and nrow(colAnnotation) == nrow(table)
. This
situation requires a bit more work from the user, but that work
would anyway have to be done to provide additional sample
annotations (see next point). In addition, this is also something a
GUI from scpGUI
could help with.
Any other columns of colAnnotation
will become colData
variables.
Notes:
NULL
instead of missing. Will try to dig it up.ecol
to sampleCol
, to match it with runCol
.I really like your suggestion and how to present the problem as vertical and horizontal splitting.
I'm however skeptic with the following:
If colAnnotation contains ecol and runCol, we are in the multi-set case, and nrow(colAnnotation) == nrow(table). This situation requires a bit more work from the user, but that work would anyway have to be done to provide additional sample annotations (see next point). In addition, this is also something a GUI from scpGUI could help with.
For SCP data for instance, nrow(table)
can exceed 1,000,000 lines meaning that the colAnnot
lines will need to be duplicated throusands of times.
Instead, I would suggest to keep nrow(colAnnot
) equal to the number of samples in both cases (ie with or without a runCol
column). When colAnnot
contains a runCol
column, use that column to retrieve the run identifiers and (thanks to an efficient internal function), match the column in assayData
that contains the same run identifiers. This will require a bit more functionality to perform the matching of columns, but the task should be trivial.
What do you think?
And at this point, we could rename ecol to sampleCol, to match it with runCol.
Yes I prefer sampleCol
, but what about quantCol
? Again, no strong opinion on this, so I let you decide.
I was wrong, nrow(colAnnotation)
would need to be equal to number of samples time number of runs. Is that what you meant? In other words, in that case, colAnnotation
would be the final colData
.
Yes I prefer
sampleCol
, but what aboutquantCol
? Again, no strong opinion on this, so I let you decide.
Whichever we choose, it should be plural.
quantCols
only. Also LF with a re-ordered peptide/protein-level table (runs are missing in this case).|------+------------+-----------|
| cols | Quant 1..N | more cols |
| | | |
| | | |
| | | |
|------+------------+-----------|
Multi-set case, multiplexed: requires quantCols
and runCol
.
|-----+------+------------+-----------|
| Run | cols | Quant 1..N | more cols |
| 1 | | | |
| 1 | | | |
|-----+------+------------+-----------|
| 2 | | | |
|-----+------+------------+-----------|
Multi-set case, LF: requires quantCols
and runCol
with a optional multiPlexCol (for plexDIA).
|-----+------+---------+-----------+-------|
| Run | cols | Quant 1 | more cols | multi |
| 1 | | | | |
| 1 | | | | |
|-----+------+---------+-----------+-------|
| 2 | | | | |
|-----+------+---------+-----------+-------|
Special case DIANN. A specialised function that parses the table to case 2.
Users can either use the arguments above or a colAnnotation
data.frame (that will become the colData
).
I'll work tonight on refactoring readQFeatures()
to accommodate for the new arguments and behaviour. I will provide the changes in a new PR.
I've addressed all comments and changes. Any idea what will break once this is merged? What about vignettes in
SCP.replication
?